1 Conventions

Just a few notes about the present tutorial. Most text will be formatted as present, but at times there will be:

a <- 1
(a <- 1)
## [1] 1

Finally, as this is a tutorial, you will be subjected to exercises. You should only proceed after answering the question asked. To answer the question, go to _http://socrative.com_, select “Student Login” and use the room number you were provided with. You will be taken directly to the active quizz, which you can answer at your own pace, however, you can answer a question only once. Once you have submitted your answer, you can press “Close” in the present tutorial; the current section of the tutorial will collapse (you can always re-open it, e.g. if you need clarifications about the question) and the next section will open. If anything were unclear at any time, do not hesitate to contact us on slack.

2 Introduction

2.1 About the tutorial

This tutorial adresses several aspect of an RNA-Seq data analysis. First, it deals with data pre-processing (a process rather agnostic to the type of NGS analysis, i.e. it would also be valid for ChIP-Seq or DNA-Seq). That part of the tutorial introduces the use of the command line (specifically the bourne-again shell: bash) and several software tools to prepare sequencing data for downstrem analysis. Specifically, we will be covering quality control, rRNA sorting, quality trimming, genome mapping and feature summarization. If these terms don’t mean much yet, hopefully they will by the end of today. Next, it deals with the manipulation of alignment files (in BAM format) and the obtention and manipulation of genomic (and genic) annotations to subsequently combine them to obtain a raw count table1. This count table is the necessary minimal input for a Differential Expression analysis, which is the ultimate step in this tutorial.

The goals of the tutorial are:

  • First part
    • become familiar with the Linux command line
    • be able to carry out quality control and necessary pre-processing steps of sequencing data
  • Second part
    • become familiar with the R environment
    • develop familiarity with R/Bioconductor packages for high-throughput analysis
  • Third part
    • become familiar with the Differential Expression type of analysis

2.2 Main softwares of interest

  1. FastQC - Sequence data quality control

  2. SortMeRNA - rRNA filtering

  3. Trimmomatic - Low quality read and adapter trimming

  4. STAR - Read mapping

  5. HTSeq and Kallisto - Feature summarization

2.3 Main R/Bioconductor packages of interest

Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Among these, we will focus on a few of them principally: GenomicAlignments, GenomicFeatures and GenomicRanges. We will look at these packages (i) as an example to understand what the process of feature summarization is (i.e. taking a look under the hood) and (ii) as an introduction to computational optimisation in R.

We will, in addition, use two packages for Differential Expression data analysis: tximport and DESeq2.

2.3.1 A word on High-throughput sequence analysis

Recent technological developments introduced high-throughput sequencing approaches. A variety of experimental protocols and analysis workflows address gene expression, regulation, and encoding of genetic variants. Experimental protocols produce a large number (tens to hundreds of millions per sample) of short (e.g.: 35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome or transcriptome. Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatory elements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g.: SNPs, short indels, large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g.: 2 samples per treatment group) to thousands of individuals.

2.3.2 A word on RNA-Seq

RNA-Seq (RNA-Sequencing) has become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi and others 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.

A pipeline, which we have developed for non-model plant organisms but is generally applicable can be seen in the figure below and a detailed description of it has been published2.

The pre-processing pipeline used during the course

The pre-processing pipeline used during the course

2.4 A word on the dataset that is used in this tutorial

As a running example, we use a dataset derived from a study performed in P. tremula (K. Robinson et al. 2014). The authors test the evolutionary theory suggesting that males and females may evolve sexually dimorphic phenotypic and biochemical traits concordant with each sex having different optimal strategies of resource investment to maximise reproductive success and fitness. Such sexual dimorphism should result in sex biased gene expression patterns in non-floral organs for autosomal genes associated with the control and development of such phenotypic traits. Hence, the authors, among other approaches have looked for gene expression differences between sex. This was achieved by an RNA-Seq differential expression analysis between samples grouped by sex. The samples have been sequenced on an Illumina HiSeq 2000, using TruSeq adapters, through a 101 cycles paired-end protocol, which yielded between 11 million and 34 million reads per sample.

In the following sections, we look at a subset of the data, corresponding to reads obtained from individuals of the RNA-seq experiment, and aligned to the P. trichocarpa reference genome. The reason why the reads were aligned to the P. trichocarpa genome is that we are currently establishing the P. tremula genome. P. trichocarpa is a closely related species (the latter present in north-western america while the former grows in northern and central europe).

As a side note, reads were retrieved from the European Nucleotide Archive (ENA, accession ID ERP002471), and were aligned to the P. trichocarpa reference genome using the STAR (Dobin et al. 2013) aligner.

2.5 A word on Integrated Development Environment (IDE)

There are numerous tools to support developing programs and softwares in R. For this course, we have selected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platform environment for working with R.

3 Introduction to R / Bioconductor

This chapter3 introduces R and Bioconductor in details. During the course, we will only scheme through it as you are supposed to be familiar with most of the concepts presented here. If that were not the case - and even if it were - consider it as homework.

3.1 Bioconductor

Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor started more than 10 years ago and gained credibility for its statistically rigorous approach to microarray pre-processing and analysis of designed experiments, and integrative and reproducible approaches to bioinformatic tasks. There are now more than 800 Bioconductor packages for expression and other microarrays, sequence analysis, flow cytometry, imaging, and other domains. The Bioconductor web site provides installation, package repository, help, and other documentation.

The Bioconductor web site is at bioconductor.org. Features include:

3.2 Statistical programming

You would have realized that all the links above are useful ones (and one is a shameless self-citation)

Many academic and commercial software products are available; why would one use R and ? One answer is to ask about the demands high-throughput genomic data places on effective computational biology software.

3.2.1 Effective computational biology software

  1. High-throughput questions make use of large data sets. This applies both to the primary data (microarray expression values, sequenced reads, ) and also to the annotations on those data (coordinates of genes and features such as exons or regulatory regions; participation in biological pathways, ). Large data sets place demands on our tools that preclude some standard approaches, such as spread sheets. Likewise, intricate relationships between data and annotation, and the diversity of research questions, require flexibility typical of a programming language rather than a narrowly-enabled graphical user interface.

  2. Analysis of high-throughput data is necessarily statistical. The volume of data requires that it be appropriately summarized before any sort of comprehension is possible. The data are produced by advanced technologies, and these introduce artifacts (e.g.: probe-specific bias in microarrays; sequence or base calling bias in RNA-seq experiments) that need to be accommodated to avoid incorrect or inefficient inference. Data sets typically derive from designed experiments, requiring a statistical approach both to account for the design and to correctly address the large number of observed values (e.g.: gene expression or sequence tag counts) and small number of samples accessible in typical experiments.

  3. Research needs to be reproducible. Reproducibility is both an ideal of the scientific method, and a pragmatic requirement. The latter comes from the long-term and multi-participant nature of contemporary science. An analysis will be performed for the initial experiment, revisited again during manuscript preparation, and revisited during reviews or in determining next steps. Likewise, analyses typically involve a team of individuals with diverse domains of expertise. Effective collaborations result when it is easy to reproduce, perhaps with minor modifications, an existing result, and when sophisticated statistical or bioinformatic analysis can be effectively conveyed to other group members.

  4. Science moves very quickly. This is driven by the novel questions that are the hallmark of discovery, and by technological innovation and accessibility. Rapidity of scientific development places significant burdens on software, which must also move quickly. Effective software cannot be too polished, because that requires that the correct analyses are ‘known’ and that significant resources of time and money have been invested in developing the software; this implies software that is tracking the trailing edge of innovation. On the other hand, leading-edge software cannot be too idiosyncratic; it must be usable by a wider audience than the creator of the software, and fit in with other software relevant to the analysis.

  5. Effective software must be accessible. Affordability is one aspect of accessibility. Another is transparent implementation, where the novel software is sufficiently documented and source code accessible enough for the assumptions, approaches, practical implementation decisions, and inevitable coding errors to be assessed by other skilled practitioners. A final aspect of affordability is that the software is actually usable. This is achieved through adequate documentation, support forums, and training opportunities.

3.2.2 Bioconductor as effective computational biology software

What features of R and Bioconductor contribute to its effectiveness as a software tool?

  1. Bioconductor is well suited to handle extensive data and annotation. Bioconductor ‘classes’ represent high-throughput data and their annotation in an integrated way. Bioconductor methods use advanced programming techniques or R resources (such as transparent data base or network access) to minimize memory requirements and integrate with diverse resources. Classes and methods coordinate complicated data sets with extensive annotation. Nonetheless, the basic model for object manipulation in R involves vectorized in-memory representations. For this reason, particular programming paradigms (e.g.: block processing of data streams; explicit parallelism) or hardware resources (e.g.: large-memory computers) are sometimes required when dealing with extensive data.

  2. R is ideally suited to addressing the statistical challenges of high-throughput data. Three examples include the development of the ‘RMA’ affy and other normalization algorithm for microarray pre-processing, use of moderated \(t\)-statistics for assessing microarray differential expression Limma, and development of negative binomial approaches to estimating dispersion read counts necessary for appropriate analysis of RNA-Seq designed experiments (Anders and Huber 2010),(M. D. Robinson, McCarthy, and Smyth 2010).

  3. Many of the ‘old school’ aspects of R and Bioconductor facilitate reproducible research. An analysis is often represented as a text-based script. Reproducing the analysis involves re-running the script; adjusting how the analysis is performed involves simple text-editing tasks. Beyond this, R has the notion of a ‘vignette’, which represents an analysis as a LaTeX document with embedded R commands. The R commands are evaluated when the document is built, thus reproducing the analysis. The use of LaTeX, and more recently of simpler markdown languages in conjonction with pandoc to generate HTML formatted reports or vignettes means that the symbolic manipulations in the script are augmented with textual explanations and justifications for the approach taken; these include graphical and tabular summaries at appropriate places in the analysis. R includes facilities for reporting the exact version of R and associated packages used in an analysis so that, if needed, discrepancies between software versions can be tracked down and their importance evaluated. While users often think of R packages as providing new functionality, packages are also used to enhance reproducibility by encapsulating a single analysis. The package can contain data sets, vignette(s) describing the analysis, R functions that might have been written, scripts for key data processing stages, and documentation (via standard R help mechanisms) of what the functions, data, and packages are about.

  4. The Bioconductor project adopts practices that facilitate reproducibility. Versions of R and Bioconductor are released once and twice each year, respectively. Each Bioconductor release is the result of development, in a separate branch, during the previous six months. The release is built daily against the corresponding version of R on Linux, Mac, and Windows platforms, with an extensive suite of tests performed. The biocLite function ensures that each release of R uses the corresponding Bioconductor packages. The user thus has access to stable and tested package versions. R and Bioconductor are effective tools for reproducible research.

  5. R and Bioconductor exist on the leading portion of the software life cycle. Contributors are primarily from academic institutions, and are directly involved in novel research activities. New developments are made available in a familiar format, i.e.: the R language, packaging, and build systems. The rich set of facilities in R - e.g.: for advanced statistical analysis or visualization - and the extensive resources in Bioconductor - e.g.: for annotation using third-party data such as Biomart (www.biomart.org(Kasprzyk 2011)) or UCSC genome browser tracks(http://genome.ucsc.edu/(Meyer et al. 2013)) - mean that innovations can be directly incorporated into existing workflows. The ‘development’ branches of R and Bioconductor provide an environment where contributors can explore new approaches without alienating their user base.

  6. R and Bioconductor also fair well in terms of accessibility. The software is freely available. The source code is easily and fully accessible for critical evaluation. The R packaging and check system requires that all functions are documented. Bioconductor requires that each package contain vignettes to illustrate the use of the software. There are very active R and Bioconductor mailing lists for immediate support, and regular training and conference activities for professional development.

3.3 Bioconductor for high-throughput sequence analysis

The table below enumerates many of the packages available for sequence analysis. The table includes packages for representing sequence-related data (e.g.: GenomicRanges, Biostrings), as well as domain-specific analysis such as RNA-seq (e.g.: edgeR, DEXSeq), ChIP-seq (e.g,. ChIPpeakAnno, DiffBind), and SNPs and copy number variation (e.g.: genoset, ggtools, VariantAnnotation).

Concept Packages
Data representation IRanges, GenomicRanges, GenomicFeatures,Biostrings, BSgenome, girafe,genomeIntervals
Input / output ShortRead(fastq), Rsamtools (bam), rtracklayer (gff, wig, bed), VariantAnnotation (vcf),R453Plus1Toolbox(454)
Annotation biomaRt(Durinck et al. 2005),GenomicFeatures,ChIPpeakAnno, VariantAnnotation
Alignment Biostrings, gmapR,Rsubread, Rbowtie
Visualization ggbio,Gviz
Quality assessment qrqc,seqbias,ReQON, htSeqTools,TEQC, Rolexa,ShortRead
RNA-seq BitSeq,cqn,cummeRbund,DESeq(Anders and Huber 2010),DESeq2[Love:2014p6358],DEXSeq[dexseq],EDASeq,edgeR(M. D. Robinson, McCarthy, and Smyth 2010),gage,goseq, iASeq,tweeDEseq
ChIP-seq BayesPeak,baySeq, ChIPpeakAnno,chipseq, ChIPseqR,ChIPsim, CSAR,DiffBind,MEDIPS, mosaics,NarrowPeaks,nucleR,PICS, PING,REDseq, Repitools,TSSi
Motifs BCRANK, cosmo,cosmoGUI, MotIV,seqLogo, rGADEM
3C HiTC,r3Cseq
Copy number cn.mops,CNAnorm,exomeCopy, seqgmentSeq
Microbiome phyloseq,DirichletMultinomial(Holmes 2012),clstutils, manta,mcaGUI
Workflows ArrayExpressHTS,Genominator,easyRNASeq,oneChannelGUI,QuasR,rnaSeqMap(Leśniewska and Okoniewski 2011)
Database SRAdb

3.4 R

R is an open-source statistical programming language. It is used to manipulate data, to perform statistical analysis, and to present graphical and other results. R consists of a core language, additional ‘packages’ distributed with the R language, and a very large number of packages contributed by the broader community. Packages add specific functionality to an R installation. R has become the primary language of academic statistical analysis, and is widely used in diverse areas of research, government, and industry.

R has several unique features. It has a surprisingly ‘old school’ interface: users type commands into a console; scripts in plain text represent workflows; tools other than R are used for editing and other tasks. R is a flexible programming language, so while one person might use functions provided by R to accomplish advanced analytic tasks, another might implement their own functions for novel data types. As a programming language, R adopts syntax and grammar that differ from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy on change’ and ‘pass by value’ semantics, reducing unexpected consequences for users at the expense of less efficient memory use; common paradigms in other languages, such as the ‘for’ loop, are encountered much less commonly in R. Many authors contribute to , so there can be a frustrating inconsistency of documentation and interface. R grew up in the academic community, so authors have not shied away from trying new approaches. Common statistical analysis functions are very well-developed.

3.4.1 R data types

Opening an R session results in a prompt. The user types instructions at the prompt. Here is an example:

    ## assign values 5, 4, 3, 2, 1 to variable 'x'
    x <- c(5, 4, 3, 2, 1)
    x
## [1] 5 4 3 2 1

The first line starts with a # to represent a comment; the line is ignored by R. The next line creates a variable x. The variable is assigned (using <-, we could have used = almost interchangeably) a value. The value assigned is the result of a call to the c function. That it is a function call is indicated by the symbol named followed by parentheses, c(). The c function takes zero or more arguments, and returns a vector. The vector is the value assigned to x. R responds to this line with a new prompt, ready for the next input. The next line asks R to display the value of the variable x. R responds by printing [1] to indicate that the subsequent number is the first element of the vector. It then prints the value of x.

R has many features to aid common operations. Entering sequences is a very common operation, and expressions of the form 2:4 create a sequence from 2 to 4. Sub-setting one vector by another is enabled with [. Here we create an integer sequence from 2 to 4, and use the sequence as an index to select the second, third, and fourth elements of x

    x[2:4]
## [1] 4 3 2

Index values can be repeated, and if outside the domain of x return the special value NA. Negative index values remove elements from the vector. Logical and character vectors (described below) can also be used for sub-setting.

R functions operate on variables. Functions are usually vectorized, acting on all elements of their argument and obviating the need for explicit iteration. Functions can generate warnings when performing suspect operations, or errors if evaluation cannot proceed; try log(-1).

    log(x)
## [1] 1.61 1.39 1.10 0.69 0.00

3.4.1.1 Essential data types

R has a number of standard data types, to represent integer, numeric (floating point), complex, character, logical (Boolean), and raw (byte) data. It is possible to convert between data types, and to discover the type or mode of a variable.

    c(1.1, 1.2, 1.3)         # numeric
## [1] 1.1 1.2 1.3
    c(FALSE, TRUE, FALSE)    # logical
## [1] FALSE  TRUE FALSE
    c("foo", "bar", "baz")   # character, single or double quote ok
## [1] "foo" "bar" "baz"
    as.character(x)          # convert 'x' to character
## [1] "5" "4" "3" "2" "1"
    typeof(x)                # the number 5 is numeric, not integer
## [1] "double"
    typeof(2L)               # append 'L' to force integer
## [1] "integer"
    typeof(2:4)              # ':' produces a sequence of integers
## [1] "integer"

R includes data types particularly useful for statistical analysis, including factor to represent categories and NA (used in any vector) to represent missing values.

    sex <- factor(c("Male", "Female", NA), levels=c("Female", "Male"))
    sex
## [1] Male   Female <NA>  
## Levels: Female Male

3.4.1.2 Lists, data frames, and matrices

All of the vectors mentioned so far are homogeneous, consisting of a single type of element. A list can contain a collection of different types of elements and, like all vectors, these elements can be named to create a key-value association.

    lst <- list(a=1:3, b=c("foo", "bar"), c=sex)
    lst
## $a
## [1] 1 2 3
## 
## $b
## [1] "foo" "bar"
## 
## $c
## [1] Male   Female <NA>  
## Levels: Female Male

Lists can be subset like other vectors to get another list, or subset with [[ to retrieve the actual list element; as with other vectors, sub-setting can use names

    lst[c(3, 1)]             # another list
## $c
## [1] Male   Female <NA>  
## Levels: Female Male
## 
## $a
## [1] 1 2 3
    lst[["a"]]               # the element itself, selected by name
## [1] 1 2 3

A data.frame is a list of equal-length vectors, representing a rectangular data structure not unlike a spread sheet. Each column of the data frame is a vector, so data types must be homogeneous within a column. A data.frame can be subset by row or column using [,], and columns can be accessed with $ or [[. In the [,] notation, row subset will be detailed before the comma and column subset afterwards, i.e.: . Either subset can be left empty.

    df <- data.frame(age=c(27L, 32L, 19L),
                     sex=factor(c("Male", "Female", "Male")))
    df
##   age    sex
## 1  27   Male
## 2  32 Female
## 3  19   Male
    df[c(1, 3),]
##   age  sex
## 1  27 Male
## 3  19 Male
    df[df$age > 20,]
##   age    sex
## 1  27   Male
## 2  32 Female

A matrix is also a rectangular data structure, but subject to the constraint that all elements are the same type. A matrix is created by taking a vector, and specifying the number of rows or columns the vector is to represent. On sub-setting, R coerces a single column data.frame or single row or column matrix to a vector if possible; use drop=FALSE to stop this behavior.

    m <- matrix(1:12, nrow=3)
    m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
    m[c(1, 3), c(2, 4)]
##      [,1] [,2]
## [1,]    4   10
## [2,]    6   12
    m[, 3]
## [1] 7 8 9
    m[, 3, drop=FALSE]
##      [,1]
## [1,]    7
## [2,]    8
## [3,]    9

An array is a data structure for representing homogeneous, rectangular data in higher dimensions.

3.4.1.3 S3 and S4 classes

More complicated data structures are represented using the ‘S3’ or ‘S4’ object system. Objects are often created by functions (for example, lm, below), with parts of the object extracted or assigned using accessor functions. The following generates 1000 random normal deviates as x, and uses these to create another 1000 deviates y that are linearly related to x but with some error. We fit a linear regression using a ‘formula’ to describe the relationship between variables, summarize the results in a familiar ANOVA table, and access fit (an S3 object) for the residuals of the regression, using these as input first to the var (variance) and then sqrt (square-root) functions. Objects can be interrogated for their class.

    x <- rnorm(1000, sd=1)
    y <- x + rnorm(1000, sd=.5)
    fit <- lm(y ~ x)       # formula describes linear regression 
    fit                    # an 'S3' object
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      0.0184       0.9860
    anova(fit)
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value Pr(>F)    
## x           1    988     988    3738 <2e-16 ***
## Residuals 998    264       0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    sqrt(var(resid(fit)))  # residuals accessor and subsequent transforms
## [1] 0.51
    class(fit)
## [1] "lm"

Many Bioconductor packages implement S4 objects to represent data. S3 and S4 systems are quite different from a programmer’s perspective, but fairly similar from a user’s perspective: both systems encapsulate complicated data structures, and allow for methods specialized to different data types; accessors are used to extract information from the objects.

3.4.1.4 Functions

R functions accept arguments, and return values. Arguments can be required or optional. Some functions may take variable numbers of arguments, e.g.: the columns in a data.frame

    y <- 5:1
    log(y)
## [1] 1.61 1.39 1.10 0.69 0.00
    args(log)        # arguments 'x' and 'base'; see ?log
## function (x, base = exp(1)) 
## NULL
    log(y, base=2)   # 'base' is optional, with default value
## [1] 2.3 2.0 1.6 1.0 0.0
    try(log())       # 'x' required; 'try' continues even on error
    args(data.frame) # ... represents variable number of arguments
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
##     fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 
## NULL

Arguments can be matched by name or position. If an argument appears after ..., it must be named.

    log(base=2, y)   # match argument 'base' by name, 'x' by position
## [1] 2.3 2.0 1.6 1.0 0.0

A function such as anova is a generic that provides an overall signature but dispatches the actual work to the method corresponding to the class(es) of the arguments used to invoke the generic. A generic may have fewer arguments than a method, as with the S3 function anova and its method anova.glm.

    args(anova)
## function (object, ...) 
## NULL
    args(stats:::anova.glm)
## function (object, ..., dispersion = NULL, test = NULL) 
## NULL

The ... argument in the anova generic means that additional arguments are possible; the anova generic hands these arguments to the method it dispatches to. Note that in the previous example, the notation “library:::function” is necessary as the anova.glm function is only present within the stats library namespace (i.e.: the package self-environment) and is not exported to the user environment. See subsection for more information about libraries.

3.4.2 Useful functions

R has a very large number of functions. The following is a brief list of those that might be commonly used and particularly useful.

dir, read.table (and friends), scan

List files in a directory, read spreadsheet-like data into , efficiently read homogeneous data (e.g.: a file of numeric values) to be represented as a matrix.

c, factor, data.frame, matrix

Create a vector, factor, data frame or matrix.

summary, table, xtabs

Summarize, create a table of the number of times elements occur in a vector, cross-tabulate two or more variables.

t.test, aov, lm, anova, chisq.test

Basic comparison of two (t.test) groups, or several groups via analysis of variance / linear models (aov output is probably more familiar to biologists), or compare simpler with more complicated models (anova); chi^2 tests.

dist, hclust

Cluster data.

plot

Plot data.

ls, str, library, search

List objects in the current (or specified) workspace, or peak at the structure of an object; add a library to or describe the search path of attached packages.

lapply, sapply, mapply, aggregate

Apply a function to each element of a list (lapply, sapply), to elements of several lists (mapply), or to elements of a list partitioned by one or more factors (aggregate).

with

Conveniently access columns of a data frame or other element without having to repeat the name of the data frame.

match, %in%

Report the index or existence of elements from one vector that match another.

split, cut

Split one vector by an equal length factor, cut a single vector into intervals encoded as levels of a factor.

strsplit, grep, sub

Operate on character vectors, splitting it into distinct fields, searching for the occurrence of a patterns using regular expressions (see ?regex, or substituting a string for a regular expression.

install.packages

Install a package from an on-line repository into your .

traceback, debug, browser

Report the sequence of functions under evaluation at the time of the error; enter a debugger when a particular function or statement is invoked.

?, example

See the help pages (e.g.: ?lm) and examples (example(match)) for each of these functions

3.4.2.1 Exemplary use case

This example uses data describing 128 microarray samples as a basis for exploring R functions. Covariates such as age, sex, type, stage of the disease, etc., are in a data file .

The following command creates a variable that is the location of a comma-separated value (‘csv’) file to be used in the exercise. A csv file can be created using, e.g.: ‘Save as…’ in spreadsheet software.

      pdataFile <- system.file(package="RnaSeqTutorial", "extdata", "pData.csv")

Input the csv file using read.table, assigning the input to a variable . Use dim to find out the dimensions (number of rows, number of columns) in the object. Are there 128 rows? Use names or colnames to list the columns’ name. Use summary to summarize each column of the data. What are the data types of each column in the data frame?

3.4.2.2 Exemplary use case (c’ed)

    pdata <- read.table(pdataFile)  
    dim(pdata)
## [1] 128  21
    names(pdata)
##  [1] "cod"            "diagnosis"      "sex"            "age"           
##  [5] "BT"             "remission"      "CR"             "date.cr"       
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"         
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"         
## [17] "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"
    summary(pdata)
##       cod           diagnosis     sex          age           BT    
##  10005  :  1   1/15/1997 :  2   F   :42   Min.   : 5   B2     :36  
##  1003   :  1   1/29/1997 :  2   M   :83   1st Qu.:19   B3     :23  
##  1005   :  1   11/15/1997:  2   NA's: 3   Median :29   B1     :19  
##  1007   :  1   2/10/1998 :  2             Mean   :32   T2     :15  
##  1010   :  1   2/10/2000 :  2             3rd Qu.:46   B4     :12  
##  11002  :  1   (Other)   :116             Max.   :58   T3     :10  
##  (Other):122   NA's      :  2             NA's   :5    (Other):13  
##  remission                  CR           date.cr    t.4.11.       
##  CR  :99   CR                :96   11/11/1997: 3   Mode :logical  
##  REF :15   DEATH IN CR       : 3   1/21/1998 : 2   FALSE:86       
##  NA's:14   DEATH IN INDUCTION: 7   10/18/1999: 2   TRUE :7        
##            REF               :15   12/7/1998 : 2   NA's :35       
##            NA's              : 7   1/17/1997 : 1                  
##                                    (Other)   :87                  
##                                    NA's      :31                  
##   t.9.22.        cyto.normal               citog        mol.biol 
##  Mode :logical   Mode :logical   normal       :24   ALL1/AF4:10  
##  FALSE:67        FALSE:69        simple alt.  :15   BCR/ABL :37  
##  TRUE :26        TRUE :24        t(9;22)      :12   E2A/PBX1: 5  
##  NA's :35        NA's :35        t(9;22)+other:11   NEG     :74  
##                                  complex alt. :10   NUP-98  : 1  
##                                  (Other)      :21   p15/p16 : 1  
##                                  NA's         :35                
##    fusion.protein   mdr          kinet       ccr           relapse       
##  p190     :17     NEG :101   dyploid:94   Mode :logical   Mode :logical  
##  p190/p210: 8     POS : 24   hyperd.:27   FALSE:74        FALSE:35       
##  p210     : 8     NA's:  3   NA's   : 7   TRUE :26        TRUE :65       
##  NA's     :95                             NA's :28        NA's :28       
##                                                                          
##                                                                          
##                                                                          
##  transplant                     f.u        date.last.seen
##  Mode :logical   REL              :61   1/7/1998  : 2    
##  FALSE:91        CCR              :23   12/15/1997: 2    
##  TRUE :9         BMT / DEATH IN CR: 4   12/31/2002: 2    
##  NA's :28        BMT / CCR        : 3   3/29/2001 : 2    
##                  DEATH IN CR      : 2   7/11/1997 : 2    
##                  (Other)          : 7   (Other)   :83    
##                  NA's             :28   NA's      :35

Let’s move onto the next topic:

  • A data frame is a list of equal length vectors. Select the ‘sex’ column of the data frame using [[ or $. Pause to explain to your neighbor why this sub-setting works. Since a data frame is a list, use sapply to ask about the class of each column in the data frame. Explain to your neighbor what this produces, and why.

The solution is that a data frame can be subset as if it were a matrix, or a list of column vectors.

    head(pdata[,"sex"], 3)
## [1] M M F
## Levels: F M
    head(pdata$sex, 3)
## [1] M M F
## Levels: F M
    head(pdata[["sex"]], 3)
## [1] M M F
## Levels: F M
    sapply(pdata, class)
##            cod      diagnosis            sex            age             BT 
##       "factor"       "factor"       "factor"      "integer"       "factor" 
##      remission             CR        date.cr        t.4.11.        t.9.22. 
##       "factor"       "factor"       "factor"      "logical"      "logical" 
##    cyto.normal          citog       mol.biol fusion.protein            mdr 
##      "logical"       "factor"       "factor"       "factor"       "factor" 
##          kinet            ccr        relapse     transplant            f.u 
##       "factor"      "logical"      "logical"      "logical"       "factor" 
## date.last.seen 
##       "factor"
  • Use table to summarize the number of males and females in the sample. Consult the help page ?table to figure out additional arguments required to include NA values in the tabulation.

The number of males and females, including NA, is

    table(pdata$sex, useNA="ifany")
## 
##    F    M <NA> 
##   42   83    3

An alternative version of this uses the with function: with(pdata, table(sex, useNA=“ifany”)).

  • The mol.biol column summarizes molecular biological attributes of each sample. Use table to summarize the different molecular biology levels in the sample.

The mol.biol column contains the following samples:

    with(pdata, table(mol.biol, useNA="ifany"))
## mol.biol
## ALL1/AF4  BCR/ABL E2A/PBX1      NEG   NUP-98  p15/p16 
##       10       37        5       74        1        1
  • Use %in% to create a logical vector of the samples that are either BCR/ABL or NEG.

A logical vector indicating that the corresponding row is either BCR/ABL or NEG is constructed as

    ridx <- pdata$mol.biol %in% c("BCR/ABL", "NEG")
  • Subset the original phenotypic data to contain those samples that are BCR/ABL or NEG.

We can get a sense of the number of rows selected via table or sum (discuss with your neighbor what sum does, and why the answer is the same as the number of TRUE values in the result of the table function).

    table(ridx)
## ridx
## FALSE  TRUE 
##    17   111
    sum(ridx)
## [1] 111
  • After sub-setting, what are the levels of the mol.biol column? Set the levels to be BCR/ABL and NEG, i.e.: the levels in the subset.

The original data frame can be subset to contain only BCR/ABL or NEG samples using the logical vector that we created.

    pdata1 <- pdata[ridx,]

The levels of each factor reflect the levels in the original object, rather than the levels in the subset object, e.g.:

    levels(pdata$mol.biol)
## [1] "ALL1/AF4" "BCR/ABL"  "E2A/PBX1" "NEG"      "NUP-98"   "p15/p16"

These can be re-coded by updating the new data frame to contain a factor with the desired levels.

    pdata1$mol.biol <- factor(pdata1$mol.biol)
    table(pdata1$mol.biol)
## 
## BCR/ABL     NEG 
##      37      74
  • One would like covariates to be similar across groups of interest. Use t.test to assess whether BCR/ABL and NEG have individuals with similar age. To do this, use a formula that describes the response age in terms of the predictor mol.biol. If age is not independent of molecular biology, what complications might this introduce into subsequent analysis? Use a boxplot to represent the age depending on the molecular biology.

To ask whether age differs between molecular biology types, we use a formula age mol.biol to describe the relationship (‘age as a function of molecular biology’) that we wish to test

    with(pdata1, t.test(age ~ mol.biol))
## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 5, df = 70, p-value = 8e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.1 17.2
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##                    40                    28

This summary can be visualized with, e.g.: the boxplot function

    boxplot(age ~ mol.biol, pdata1)

For a fancier plot, see the following that uses some additional arguments of the boxplot function.

     boxplot(age ~ mol.biol, pdata1,notch=TRUE,ylab="age (yr)",
            main="Age distribution by genotype",xlab="genotype")

3.5 Packages

Packages provide functionality beyond that available in base . There are almost 5,500 packages in CRAN (comprehensive R archive network) and more than 800 Bioconductor software packages. Packages are contributed by diverse members of the community; they vary in quality (many are excellent) and sometimes contain idiosyncratic aspects to their implementation. The following table outlines key base packages and selected contributed packages; see a local CRAN mirror (including the task views summarizing packages in different domains) and Bioconductor for additional contributed packages.

Package Description
base Data input and essential manipulation; scripting and programming concepts.
stats Essential statistical and plotting functions.
lattice, ggplot2 Approaches to advanced graphics.
methods ‘S4’ classes and methods.
parallel Facilities for parallel evaluation.

The lattice package illustrates the value packages add to base . lattice is distributed with R but not loaded by default. It provides a very expressive way to visualize data. The following example plots yield for a number of barley varieties, conditioned on site and grouped by year. The following figure is read from the lower left corner. Note the common scales, efficient use of space, and not-too-pleasing default color palette. The Morris sample appears to be mis-labeled for ‘year’, an apparent error in the original data. Find out about the built-in data set used in this example with ?barley.

    library(lattice)
    plt <- dotplot(variety ~ yield | site, data = barley, groups = year,
                   xlab = "Barley Yield (bushels/acre)" , ylab=NULL,
                   key = simpleKey(levels(barley$year), space = "top", 
                     columns=2),
                   aspect=0.5, layout = c(2,3))
    print(plt)

New packages can be added to an R installation using install.packages. A package is installed only once per R installation, but needs to be loaded (with library) in each session in which it is used. Loading a package also loads any package that it depends on. Packages loaded in the current session are displayed with search. The ordering of packages returned by search represents the order in which the global environment (where commands entered at the prompt are evaluated) and attached packages are searched for symbols; it is possible for a package earlier in the search path to mask symbols later in the search path; these can be disambiguated using ::.

    length(search())
## [1] 9
    search()
## [1] ".GlobalEnv"        "package:lattice"   "package:stats"    
## [4] "package:graphics"  "package:grDevices" "package:utils"    
## [7] "package:datasets"  "Autoloads"         "package:base"
    base::log(1:3)
## [1] 0.00 0.69 1.10

Use the library function to load the RnaSeqTutorial package. Use the sessionInfo function to verify that you are using R version 3.2.0 and current packages, similar to those reported here. What other packages were loaded along with RnaSeqTutorial?

    library(RnaSeqTutorial)
    sessionInfo()

3.5.1 Help

Find help using the R help system. Start a web browser with:

    help.start()

The ‘Search Engine and Keywords’ link is helpful in day-to-day use.

3.5.1.1 Manual pages

Use manual pages to find detailed descriptions of the arguments and return values of functions, and the structure and methods of classes. Find help within an R session as:

    ?data.frame
    ?lm
    ?anova             # a generic function
    ?anova.lm          # an S3 method, specialized for 'lm' objects

S3 methods can be queried interactively. For S3,

    methods(anova)
## [1] anova.glm*     anova.glmlist* anova.lm*      anova.lmlist* 
## [5] anova.loess*   anova.mlm*     anova.nls*    
## see '?methods' for accessing help and source code
    methods(class="glm")
##  [1] add1           anova          confint        cooks.distance
##  [5] deviance       drop1          effects        extractAIC    
##  [9] family         formula        influence      logLik        
## [13] model.frame    nobs           predict        print         
## [17] residuals      rstandard      rstudent       summary       
## [21] vcov           weights       
## see '?methods' for accessing help and source code

It is often useful to view a method definition, either by typing the method name at the command line or, for ‘non-visible’ methods, using getAnywhere:

    ls
    getAnywhere("anova.loess")

Here we discover that the function head (which returns the first 6 elements of anything) defined in the utils package, is an S3 generic (indicated by UseMethod) and has several methods. We use head to look at the first six lines of the head method specialized for matrix objects.

    utils::head
## function (x, ...) 
## UseMethod("head")
## <bytecode: 0x7f8f4bdfddf0>
## <environment: namespace:utils>
    methods(head)
## [1] head.data.frame* head.default*    head.ftable*     head.function*  
## [5] head.matrix      head.table*     
## see '?methods' for accessing help and source code
    head(head.matrix)
##                                 
## 1 function (x, n = 6L, ...)     
## 2 {                             
## 3     stopifnot(length(n) == 1L)
## 4     n <- if (n < 0L)          
## 5         max(nrow(x) + n, 0L)  
## 6     else min(n, nrow(x))

S4 classes and generics are queried in a similar way to S3 classes and generics, but with different syntax, as for the complement generic in the Biostrings package:

    library(Biostrings)
    showMethods(complement)

(Most) methods defined on the DNAStringSet class of Biostrings and available on the current search path can be found with

    showMethods(class="DNAStringSet", where=search())

Obtaining help on S4 classes and methods requires syntax such as

    class ? DNAStringSet
    method ? "complement,DNAStringSet"

The specification of method and class in the latter must not contain a space after the comma. The definition of a method can be retrieved as

    selectMethod(complement, "DNAStringSet")

3.5.1.2 Vignettes

Vignettes, especially in Bioconductor packages, provide an extensive narrative describing overall package functionality. Use

    vignette(package="RnaSeqTutorial")

to see, in your web browser, vignettes available in the RnaSeqTutorial package. Vignettes usually consist of text with embedded R code, a form of literate programming. The vignette can be read as a PDF document, while the R source code is present as a script file ending with extension .R. The script file can be sourced or copied into an R session to evaluate exactly the commands used in the vignette.

Use the following to open the vignette of interest:

    vignette("RnaSeqTutorial")

Scavenger hunt. Spend five minutes tracking down the following information.

  1. The package containing the library function.

  2. The author of the alphabetFrequency function, defined in the Biostrings package.

  3. A description of the GAlignments class.

  4. The number of vignettes in the GenomicRanges package.

Possible solutions are found with the following R commands

    ?library
    library(Biostrings)
    ?alphabetFrequency
    class?GAlignments
    vignette(package="GenomicRanges")

3.5.2 Efficient scripts

There are often many ways to accomplish a result in R , but these different ways often have very different speed or memory requirements. For small data sets these performance differences are not that important, but for large data sets (e.g.: high-throughput sequencing; genome-wide association studies, GWAS) or complicated calculations (e.g.: bootstrapping) performance can be important. There are several approaches to achieving efficient R programming.

3.5.2.1 Easy solutions

Several common performance bottlenecks often have easy solutions; these are outlined here.

Text files often contain more information, for example 1000’s of individuals at millions of SNPs, when only a subset of the data is required, e.g.: during algorithm development. Reading in all the data can be demanding in terms of both memory and time. A solution is to use arguments such as colClasses to specify the columns and their data types that are required, and to use nrow to limit the number of rows input. For example, the following ignores the first and fourth column, reading in only the second and third (as type integer and numeric).

    # not evaluated
    colClasses <-
     c("NULL", "integer", "numeric", "NULL")
    df <- read.table("myfile", colClasses=colClasses)

R is vectorized, so traditional programming for loops are often not necessary. Rather than calculating 100000 random numbers one at a time, or squaring each element of a vector, or iterating over rows and columns in a matrix to calculate row sums, invoke the single function that performs each of these operations.

    x <- runif(100000); x2 <- x^2
    m <- matrix(x2, nrow=1000); y <- rowSums(m)

This often requires a change of thinking, turning the sequence of operations ‘inside-out’. For instance, calculate the log of the square of each element of a vector by calculating the square of all elements, followed by the log of all elements x2 \<- x2; x3 <- log(x2), or simply logx2 \<- log(x2).

It may sometimes be natural to formulate a problem as a for loop, or the formulation of the problem may require that a for loop be used. In these circumstances the appropriate strategy is to pre-allocate the object, and to fill the result in during loop iteration.

    ## not evaluated
    result <- numeric(nrow(df))
    for (i in seq_len(nrow(df)))
     result[[i]] <- some_calc(df[i,])

Some R operations are helpful in general, but misleading or inefficient in particular circumstances. An example is the behavior of unlist when the list is named – R creates new names that have been made unique. This can be confusing (e.g.: when Entrez gene identifiers are ‘mangled’ to unintentionally look like other identifiers) and expensive (when a large number of new names need to be created). Avoid creating unnecessary names, e.g.:

    unlist(list(a=1:2)) # name 'a' becomes 'a1', 'a2'
## a1 a2 
##  1  2
    unlist(list(a=1:2), use.names=FALSE)   # no names
## [1] 1 2

Names can be very useful for avoiding book-keeping errors, but are inefficient for repeated look-ups; use vectorized access or numeric indexing.

3.5.2.2 Moderate solutions

Several solutions to inefficient code require greater knowledge to implement.

Using appropriate functions can greatly influence performance; it takes experience to know when an appropriate function exists. For instance, the lm function could be used to assess differential expression of each gene on a microarray, but the limma package implements this operation in a way that takes advantage of the experimental design that is common to each probe on the microarray, and does so in a very efficient manner.

    ## not evaluated
    library(limma) # microarray linear models
    fit <- lmFit(eSet, design)

Using appropriate algorithms can have significant performance benefits, especially as data becomes larger. This solution requires moderate skills, because one has to be able to think about the complexity (e.g.: expected number of operations) of an algorithm, and to identify algorithms that accomplish the same goal in fewer steps. For example, a naive way of identifying which of 100 numbers are in a set of size 10 might look at all 100 combinations of numbers (i.e.: polynomial time), but a faster way is to create a ‘hash’ table of one of the set of elements and probe that for each of the other elements (i.e.: linear time). The latter strategy is illustrated with

    x <- 1:100; s <- sample(x, 10)
    inS <- x %in% s

R is an interpreted language, and for very challenging computational problems it may be appropriate to write critical stages of an analysis in a compiled language like C or Fortran, or to use an existing programming library (e.g.: the BOOST graph library) that efficiently implements advanced algorithms. R has a well-developed interface to C or Fortran, so it is ‘easy’ to do this. This places a significant burden on the person implementing the solution, requiring knowledge of two or more computer languages and of the interface between them.

3.5.2.3 Measuring performance

When trying to improve performance, one wants to ensure (a) that the new code is actually faster than the previous code, and (b) both solutions arrive at the same, correct, answer.

The system.time function is a straight-forward way to measure the length of time a portion of code takes to evaluate. Here we see that the use of apply to calculate row sums of a matrix is much less efficient than the specialized rowSums function.

    m <- matrix(runif(200000), 20000)
    replicate(5, system.time(apply(m, 1, sum))[[1]])
## [1] 0.027 0.024 0.023 0.020 0.027
    replicate(5, system.time(rowSums(m))[[1]])
## [1] 0.001 0.001 0.001 0.000 0.001

Usually it is appropriate to replicate timings to average over vagaries of system use, and to shuffle the order in which timings of alternative algorithms are calculated to avoid artifacts such as initial memory allocation.

Speed is an important metric, but equivalent results are also needed. The functions identical and all.equal provide different levels of assessing equivalence, with all.equal providing ability to ignore some differences, e.g.: in the names of vector elements.

    res1 <- apply(m, 1, sum)
    res2 <- rowSums(m)
    identical(res1, res2)
## [1] TRUE
    identical(c(1, -1), c(x=1, y=-1))
## [1] FALSE
    all.equal(c(1, -1), c(x=1, y=-1),
              check.attributes=FALSE)
## [1] TRUE

Two additional functions for assessing performance are Rprof and tracemem; these are mentioned only briefly here. The Rprof function profiles R code, presenting a summary of the time spent in each part of several lines of R code. It is useful for gaining insight into the location of performance bottlenecks when these are not readily apparent from direct inspection. Memory management, especially copying large objects, can frequently contribute to poor performance. The tracemem function allows one to gain insight into how R manages memory; insights from this kind of analysis can sometimes be useful in restructuring code into a more efficient sequence.

3.5.3 Warnings, errors, and debugging

R signals unexpected results through warnings and errors. Warnings occur when the calculation produces an unusual result that nonetheless does not preclude further evaluation. For instance log(-1) results in a value NaN (‘not a number’) that allows computation to continue, but at the same time signals an warning

> log(-1)
[1] NaN
Warning message:
In log(-1) : NaNs produced

Errors result when the inputs or outputs of a function are such that no further action can be taken, e.g.: trying to take the square root of a character vector

> sqrt("two")
Error in sqrt("two") : Non-numeric argument to mathematical function

Warnings and errors occurring at the command prompt are usually easy to diagnose. They can be more enigmatic when occurring in a function, and exacerbated by sometimes cryptic (when read out of context) error messages.

An initial step in coming to terms with errors is to simplify the problem as much as possible, aiming for a ‘reproducible’ error. The reproducible error might involve a very small (even trivial) data set that immediately provokes the error. Often the process of creating a reproducible example helps to clarify what the error is, and what possible solutions might be.

Invoking traceback() immediately after an error occurs provides a ‘stack’ of the function calls that were in effect when the error occurred. This can help understand the context in which the error occurred. Knowing the context, one might use debug to enter into a browser (see ?browser) that allows one to step through the function in which the error occurred.

It can sometimes be useful to use global options (see ?options) to influence what happens when an error occurs. Two common global options are error and warn. Setting error=recover combines the functionality of traceback and debug, allowing the user to enter the browser at any level of the call stack in effect at the time the error occurred. Default error behavior can be restored with options(error=NULL). Setting warn=2 causes warnings to be promoted to errors. For instance, initial investigation of an error might show that the error occurs when one of the arguments to a function has value NaN. The error might be accompanied by a warning message that the NaN has been introduced, but because warnings are by default not reported immediately it is not clear where the NaN comes from. warn=2 means that the warning is treated as an error, and hence can be debugged using traceback, debug, and so on.

Additional useful debugging functions include browser, trace, and setBreakpoint.

3.5.4 Asking for help

As already mentioned, help can be sought from mailing lists. For question with regards to R, adress them to the:
R Mailing lists: http://www.r-project.org For Bioconductor questions, adress them to the:
Bioconductor Mailing lists: http://bioconductor.org/help/mailing-list/

. Providing this information will fasten the help and the issue resolution if there’s one.

3.5.5 Resources

3.5.5.1 Publications and Books

  • Dalgaard (Dalgaard 2008) provides an introduction to statistical analysis with .

  • Kabaloff (Kabacoff 2010) provides a broad survey of .

  • Matloff (Matloff 2011) introduces R programming concepts.

  • Chambers (Chambers 2008) provides more advanced insights into .

  • Gentleman (Gentleman 2008) emphasizes use of R for bioinformatic programming tasks.

  • The R web site enumerates additional publications from the user community.

3.5.5.2 Integrated Development Environment (IDE)

The RStudio environment provides a nice, cross-platform environment for working in .

3.5.5.3 Other resources

4 Raw Data Pre-processing

4.1 Getting your share of the data

If you have provided data, you have been instructed where to find these. Simply adapt the instruction in this chapter to pre-process your data instead. However, before you go ahead, downsize your data to an amount manageable in the limited time we have. To do so adapt the following instructions:

# read the compressed forward fasq file, 
# piping the result in sed to extract the 
# first 4 million lines (equivalent to 1M sequences, 
# as one sequence entry always span 4 lines)
# and finally compressing it as a new file
zcat pair_1.fastq.gz | sed -n 1,20000000p | gzip -c > pair-subset_1.fastq.gz
# same for read 2 (the reverse read) if your data is Paired-End
zcat pair_2.fastq.gz | sed -n 1,20000000p | gzip -c > pair-subset_2.fastq.gz

If you had no data, or were not at liberty to share it, we have an exemplary dataset, described below.

The P. tremula sexual dimorphism dataset (K. Robinson et al. 2014) contains 17 samples in total. As we are working with PE (Paired-End) data, to process one sample, select at least a pair of files from the table below. To identify which files you will be pre-processing, use the prefix of the files in the first column. In this file name, “_1” indicate that the sequences are from the first mate of the PE, whereas “_2” represent the second mate sequences.

File Sample
202_subset_1.fq.gz 01
202_subset_2.fq.gz 01
207_subset_1.fq.gz 02
207_subset_2.fq.gz 02
213.1_subset_1.fq.gz 03
213.1_subset_2.fq.gz 03
221_subset_1.fq.gz 04
221_subset_2.fq.gz 04
226.1_subset_1.fq.gz 05
226.1_subset_2.fq.gz 05
229_subset_1.fq.gz 06
229_subset_2.fq.gz 06
229.1_subset_1.fq.gz 07
229.1_subset_2.fq.gz 07
235_subset_1.fq.gz 08
235_subset_2.fq.gz 08
236_subset_1.fq.gz 09
236_subset_2.fq.gz 09
239_subset_1.fq.gz 10
239_subset_2.fq.gz 10
244_subset_1.fq.gz 11
244_subset_2.fq.gz 11
303_subset_1.fq.gz 12
303_subset_2.fq.gz 12
305.3_subset_1.fq.gz 13
305.3_subset_2.fq.gz 13
309.1_subset_1.fq.gz 14
309.1_subset_2.fq.gz 14
310.3_subset_1.fq.gz 15
310.3_subset_2.fq.gz 15
337.1_subset_1.fq.gz 16
337.1_subset_2.fq.gz 16
349.2_subset_1.fq.gz 17
349.2_subset_2.fq.gz 17
Step Requires both files
Quality Controlusing FastQC NO
rRNA filtering using SortMeRna YES
Quality Control using FastQC NO
Quality trimming and adapter removal using Trimmomatic YES
Quality Control using FastQC NO
Trimmed reads alignment using STAR YES
BAM format manipulation using samtools NO
Read count summarization using HTSeq htseq-count NO

4.2 Raw data FastQC

Note that in the following we will refer to any of the 17 samples as “read”, so if you selected sample 01, “read” means “202_subset” for you.

The first pre-processing step is to assess the quality of the raw data received from the sequencing facility. This data is commonly delivered in FASTQ format (Cock et al. 2009).

Upon receiving the RNA-Seq FASTQ files from the sequencing facility, it is essential that some initial QC assessments be performed. Most importantly, one should check the overall sequence quality, the GC percentage distribution ( the proportion of guanine and cytosine bp across the reads) and the presence / absence of overrepresented sequences. FastQC has become a de-facto standard for performing this task http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. A FastQC command line example is as follows:

# First, make an output directory called "qa" and a sub-directory of
# that called "raw"
mkdir -p qa/raw

# Then make copies of your sample files into your home folder. Always try to
# work on copies of the raw source data.
cp share/Day1/data/fastq/read_1.fq.gz .
cp share/Day1/data/fastq/read_2.fq.gz .

# Then you can run fastqc on the file(s) you have selected
fastqc -o qa/raw read_1.fq.gz

The output of FastQC is a zip archive and an HTML document, which is sub-divided into sections describing the specific metrics that were analyzed. These sections are detailed below. In order to view the html page that was created, go to the course webpage and click on “Connect to Apache2” , then click on home > qa > raw and finally on the html file.

4.2.1 Basic Statistics

Most metrics within this section are self-explanatory. For PE reads, the total number of sequences should match between the forward and reverse read files. It is good practice to take note of the FASTQ Phred encoding, as some downstream tools require the user to specify whether Phred64 or Phred33 encoding should be used. Finally, the %GC should lie within the expected values for the sample species.

4.2.2 Per base sequence quality

The Phred scale quality represents the probability p that the corresponding base call is incorrect. A Phred score Q is an integer mapping of p where: \[ Q = -10 \cdot log10(p) \] Briefly, a Phred score of 10 corresponds to one error in every 10 base calls or 90% accuracy; a Phred score of 20 to one error in every 100 base calls or 99% accuracy. The maximum Phred score is 40 (41 for Illumina version 1.8+ encoding). See http://en.wikipedia.org/wiki/FASTQ_format#Quality for more details on the quality and http://en.wikipedia.org/wiki/FASTQ_format#Encoding for information on the corresponding encoding.

The second FastQC section details the Phred scaled quality as a function of the position in the read. It is very common to observe a quality decrease as a function of the read length (Figure 2) and this pattern is often more pronounced for read2 than it is for read1; this is due to cumulative stochastic errors of the sequencing progresses, largely as a result of the enzyme “tiring out”, and the increasing likelihood that a read cluster becomes out of sync, for example.

Example of Quality Control reports at different stage of the pipeline. A. The Per sequence GC content of the raw data. B. The same data shown in A but after rRNA filtering. C. Per base sequence content of the raw data. D. The same data after quality-based trimming has been performed.

Example of Quality Control reports at different stage of the pipeline. A. The “Per sequence GC content” of the raw data. B. The same data shown in A but after rRNA filtering. C. “Per base sequence content” of the raw data. D. The same data after quality-based trimming has been performed.

4.2.3 Per sequence quality scores

This section details the quality distribution at the read level, in contrast to the quality per base position of the previous section. If the data is of good quality, the histogram will be skewed to the right.

4.2.4 Per base sequence content

In this section, the average proportion of individual bases (A, C, G and T) is plotted as a line across the length of the reads. The 12 first bases often show a bias that is characteristic of Illumina RNA-Seq data. This is in contrast with the DNA-Seq protocol, which does not show the same bias. The difference between protocols lies in three additional steps performed during the conversion of mRNA to cDNA, which is subsequently sequenced as if it were genomic DNA. Several hypotheses have been proposed as to the cause of this bias: during reverse transcription of the captured cDNA, random hexamer primers are used and these may introduce a positional bias of the reads; artifacts from end repair; and possibly a tenuous sequence specificity of the polymerase may each play a role either singularly in, most likely, in combination.

4.2.5 Per base GC content

Similar to the previous section, the GC content is shown as a function of the position in the read. As previously observed, a bias for the first base pairs (once more in contrast to DNA sequencing data) will often be observed. In addition, for non-strand specific RNA-Seq data, the amount of G and C and of A and T should be similar, as an average, at any position within reads. Moreover the proportion of G+C should match the expected GC content of the sample. For strand-specific data, if the RNA was selected using poly-dT beads, enrichment for T over A may be observed.

4.2.6 Per sequence GC content

The plot in this section (see Figure 2 for an example) represents the distribution of GC content per read, where the data (red curve) is expected to approximately follow the theoretical distribution (blue curve). If the curve presents a shoulder in a region of high GC content, this is usually an indication that rRNA is present in the sample. However, it may also represent contamination by an organism with a higher GC content (such as bacteria or fungi). In contrast, a peak on the left hand side would indicate enrichment for A/T rich sequences. In particular a sharp peak for very low GC content (in the 0-3 range) is usually indicative of the sequencing of the mRNA poly-A tails. If this plot still shows issues after quality and rRNA filtering, additional steps would have to be taken to filter contaminants.

A comparison of the theoretical and observed GC distribution, the blue and read lines of FastQC Per sequence GC content. A. Examples of observed GC distribution with a poly-A enrichment (green), rRNA enrichment (red) or no (black) bias. B. The corresponding theoretical curve that FastQC would devise from such read GC content distribution.

A comparison of the “theoretical” and “observed” GC distribution, the blue and read lines of FastQC “Per sequence GC content”. A. Examples of “observed” GC distribution with a poly-A enrichment (green), rRNA enrichment (red) or no (black) bias. B. The corresponding “theoretical” curve that FastQC would devise from such read GC content distribution.

4.2.7 Per base N content

This plot shows the fraction of indistinguishable bases as a function of the base position in the reads. In high quality sequence data this is expected to be close to zero. Deviations from the expected values indicate problems during the sequencing.

4.2.8 Sequence length distribution

This section shows the distribution of read lengths. Prior to trimming, there should be exactly one peak located at the sequenced read length.

4.2.9 Sequence duplication level

This plot represents the level of duplicate sequences in the library. FastQC assumes that the library is diverse, with even representation of all sequences, it assumes a uniform coverage as would usually be obtained for DNA-Seq experiments. However, this assumption is not valid for RNA-Seq libraries, which have a large dynamic range, possibly containing a million fold difference between lowly and highly expressed genes. As a result it is common to observe high duplication levels for sequences originating from highly expressed genes. It is worth noting that before version 0.11 of FastQC, all duplication levels >= 10 were aggregated into a single bin. In more recent version this has been made more comprehensive in order to provide a more accurate representation of the data.

4.2.10 Overrepresented sequences

This table shows sequences that are present at unusually large frequency in the reads. These are most commonly sequencing adapters and will be identified as such. If unidentified sequences are detailed these may originate from rRNA or other contaminants, in which case contaminant filtering should be considered. Often a BLAST search of the unidentified sequence using the NCBI nt database will be informative.

4.2.11 Kmer content

The final plot of the FastQC report details the occurrence of kmers - nucleotide sequences of fixed k length - that were present at a higher than expected frequency as a function of their position within the read. Commonly, the early bases show the aforementioned Illumina sequencing bias (see section d), whereas the last bases may show enrichment for sequencing adapters.

Run FastQC for your assigned read file. Compare the results with that of your teammate. Does the QC looks fine to you or are there issues you would address?

Take a comparative look at samples 202, 207 and 213. Do these 3 samples have identical qualities? If not, what differentiates them? What does the sample you have analysed look like most among these 3?

There are 4 samples that have differing %GC characteristics as well as possibly some over-represented sequences: samples 202, 213.1, 303 and 310.3. All samples otherwise show some adapter contamination and the usual average quality drop as sequencing cycles progresses. Both these issues can be remediated and this is what will be addressed in the following.

4.3 Filtering rRNA

Typically, wet-lab protocols to extract mRNA include a step to deplete the sample of rRNA or to enrich it for poly-adenylated transcripts (rRNA is not poly-adenylated). Common approaches to achieve this are to use RiboMinusTM kits (Life Technologies) or poly-dT beads, respectively or to include a precipitation step that selectively precipitates only long (usually >200 bp) nucleotidefragments. No approach will be fully sensitive and, as a result, some rRNA carryover is to be expected. This is not a problem per se as long as the remaining proportion accounts for a low percentage of the reads (commonly between 0.1 and 3%). Larger proportions will have an effect on the usable number of reads obtained from the samples, as fewer sequence reads would originate from expressed mRNAs. This is not to be overlooked as these rRNAs will produce valid alignments (in all reference genomes and for most de novo assembled transcriptomes and genomes) and hence other metrics (such as the alignment rate) will fail to identify such a bias. To control for the rRNA quantity in our sample FastQ files, we use SortMeRna, a tool originally developed to identify rRNA in metagenomics analyses (Kopylova, Noé, and Touzet 2012). The tool accepts FASTQ files (SE or PE) as input and includes a set of reference libraries containing the most common rRNAs (5,5.8,16, 18, 23 and 26-28S). Example command lines for a PE sample are:

# First uncompress your forward and reverse reads:
gunzip read_1.fq.gz read_2.fq.gz

# Merge forward and reverse reads into a single file (this is a requirement
# of the sortmerna software)
merge-paired-reads.sh read_1.fq read_2.fq read-interleaved.fq

# Run sortmerna
sortmerna --ref $SORTMERNA_DB --reads read-interleaved.fq --aligned \
read-rRNA-hits --other read-sortmerna --log -a 4 -v --paired_in --fastx

# Split the file again in two separate files, one for forward, one for reverse
# orientation
unmerge-paired-reads.sh read-sortmerna.fq read-sortmerna_1.fq read-sortmerna_2.fq

# Delete the interleaved files and compress the result files to save space
rm read-interleaved.fq read-sortmerna.fq
gzip read-sortmerna_1.fq read-sortmerna_2.fq

The format conversion step is not required for SE samples, nor is the “–paired_in” argument. The SORTMERNA_DB environment variable needs to be set at installation and the “-a” argument details the number of CPUs/threads to be used. The tool manual provides a comprehensive description of all functionalities. SortMeRna does not currently support compressed input, hence the first and last step to (de)compress the data.

Run SortMeRna on your sample (both read_1 and read_2). This may take a while (about 10 mins). Once done, look at the produced log file.

Next, we perform the QC on the filtered data.

4.4 Filtered data FastQC

The filtered data is again subjected to a QC assessment by FastQC to ensure the validity of the filtering steps.

Redo the FastQC on the SortMeRna files you have generated.

The GC content plot should show the biggest change, now fitting more closely to the theoretical distribution, as shown in Figure 2A and Figure 2B, which represent the raw and filtered GC content respectively. Shoulders, which were present in regions of higher GC content, should be noticeably smaller or be absent altogether. rRNA overrepresented sequences should no longer be identified in the corresponding table of over-represented sequences. Finally, the theoretical GC curve should be centered closer to the expected GC value of the sample organism.

4.5 Quality trimming and adapter removal

It is a fact that on Illumina sequencers, the quality of a base pair is linked to its position in the read, bases in the later cycles of the sequencing process have a lower average quality than the earliest cycles (as was observed in the QC report above). This effect depends on the sequencing facility and on the chemistry used and it is only recently that sequencing aligners have integrated methods to correct for this - and not all alignment software does so. A common approach to increase the mapping rate of reads is to trim (remove) low quality bases from the 3’ end until the quality reaches a user-selected Phred-quality threshold. A threshold of 20 is widely accepted as it corresponds to a base call error of 1 in a 100, which is approximately the inherent technical error rate of the Illumina sequencing platform.

An additional issue encountered with Illumina sequencing is the presence of partial adapter sequences within sequenced reads. This arises when the sample fragment size has a large variance and fragments shorter than the sequencer read-length are sequenced. As the resulting reads may contain a significant part of the adapter - a bp sequence of artificial origin - earlier generation alignment software ( those that do not use Maximum Exact Matching and require global alignment of an entire read) may not be able to map such reads. Being able to identify the adapter-like sequences at the end of the read and clip/trim them - a process termed adapter removal - may consequently significantly increase the aligned read proportion.

There are numerous tools available to perform either or both of these tasks (quality trimming and adapter removal). Here, we exemplify using Trimmomatic (Bolger, Lohse, and Usadel 2014), a tool that does both. The command line for our PE sample example is as follows:

 trimmomatic PE -threads 4 -phred64 read-sortmerna_1.fq.gz \
 read-sortmerna_2.fq.gz read-sortmerna-trimmomatic_1.fq.gz \
 read-sortmerna-unpaired_1.fq.gz read-sortmerna-trimmomatic_2.fq.gz \
 read-sortmerna-unpaired_2.fq.gz \
 ILLUMINACLIP:"/usr/share/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
 SLIDINGWINDOW:5:20 MINLEN:50

Run Trimmomatic on your filtered FASTQ data.

Once again what would you do after having quality-trimmed and removed the sequencing adapter from your reads?

Yes! Quality Control! As in the next section.

4.6 Trimmed data FastQC

A final FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent and without introducing any newly apparent technical biases.

Redo the FastQC on the Trimmomatic files you have generated.

Several changes should be observed in comparison with the previous QC report: first, the per-base quality scores should be noticeably different. As shown in Figure 2C,D the per-sequence quality distribution is now shifted towards higher scores (the trimming effect) and sequencing adapters are no longer identified as over-represented (the adapter removal effect). If over-represented sequences remain, this indicates that an additional kind of contamination may be present and should be investigated.

5 Short Read Alignment

Once the raw read quality has been assessed and determined to be sufficient, or the data has been filtered and trimmed to acceptable standards, the reads can be aligned to a reference. This process is an extremely active field of research and novel aligners are frequently published. There is, sadly, no “silver bullet” and the choice of aligners will be dependent on the reference to be used (genome or transcriptome), the data type (short vs. longer reads) and the available computational power, among other factors. Most recent aligners use either BWT (Burrows M 1994) (Burrows-Wheeler transformation) or MEM (Khan et al. 2009) (Maximum Exact Matches) to perform alignment. Older generation alignment algorithms relied on a spliced-seed approach (Heng Li and Homer 2010). The numerous implementations of these different strategies all come with a myriad of options that may significantly affect the alignment outcome. Selecting the most accurate aligner and determining the optimal parameter set for a project can often represent a small project in itself. At the time of writing this guide there was no guideline available as to which aligner is most appropriate for a given situation (read length, type of reference, etc.). Hence, in the following, we exemplify using aligners that we have incorporated in our processing pipeline based on internal benchmarking for our most common experimental setup: tree genome / transcriptome, Illumina HiSeq 2500, 101bp PE sequencing. The aligner of choice varies based on the type of reference available for the project: For genome based alignment of RNA-Seq data we use STAR, a MEM based aligner - it actually uses MMP (maximum mappable prefix, a variation of MEM); for alignment of RNA-Seq data to a reference transcriptome (Dobin et al. 2013) we use either bowtie (version 1, BWT FM-index based, (Langmead et al. 2009)) or the BWT FM-index or MEM implementations of BWA(H. Li and Durbin 2010), (H. Li and Durbin 2009).

5.1 Alignment to the genome

5.1.1 Indexing the genome

NOTE: The command line below is shown for exemplary purposes, NOT to be executed :-)

First, the genome needs to be indexed. This is performed using the following command:

STAR --runMode genomeGenerate --genomeDir GENOME/Indices --genomeFastaFiles \
GENOME/FASTA/genome.fa --runThreadN 8 --sjdbOverhang 100 \
--sjdbGTFfile GENOME/GTF/genome.gtf

where the genomeDir parameter specifies the output directory for the index, genomeFastaFiles specifies the genome FASTA file path and sjdbGTFfile the file path of the gene annotation file, which can typically be retrieved from EnsEMBL (in gtf format) or UCSC (in gff3 format, to be converted in gtf format). We also provide an additional option that would need to be edited depending on your sequencing read length (sjdbOverhang 100); we selected 100 as our longuest reads are 101bp long - see the STAR manual for the rationale behind this.

5.1.2 Performing the alignment

Once the genome index is built, we can align our sample reads to it. This is achieved as follows:

mkdir star
STAR --genomeDir share/Day1/data/indices/STAR --readFilesIn read-sortmerna-trimmomatic_1.fq.gz \
read-sortmerna-trimmomatic_2.fq.gz --runThreadN 4 --alignIntronMax 11000 --outSAMstrandField intronMotif \
--sjdbGTFfile share/Day1/data/reference/gff/Ptrichocarpa_v3.0_210_synthetic-gene-models-wo-introns.gff3 \
--readFilesCommand zcat --outFileNamePrefix star/read-sortmerna-trimmomatic-STAR  --outSAMmapqUnique 254 \
--outQSconversionAdd -31 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM --outFilterMultimapNmax 100 --chimSegmentMin 1 --outWigType bedGraph \
--sjdbGTFtagExonParentTranscript Parent

where there are a number of additional parameters: --alignIntronMax is important to specify so that STAR does not try to align split reads across a distance greater than --alignIntronMax bp, reads that span an exon-exon junction (EEJ) only need to span at most the longest intron in your genome. Note that the largest intron in P. trichocarpa is about 11kb long. The parameter --outFileNamePrefix sets the path and prefix to where the results will be written (note that from now on, as the reads have been combined into a single result file, we refer to our exemplary data as “sample”). We provide a few additional parameters that may require adjustment based on your data:

Our sample files are GZip compressed so we inform STAR how to read it (readFilesCommand zcat). As our files were generated using the Illumina v1.5 FASTQ format, we convert them into Sanger FASTQ (outQSconversionAdd -31) and finally we specify that STAR should output unmapped reads separately (outReadsUnmapped Fastx). The outSAMtype BAM SortedByCoordinate ensures that the final result is already converted to the “BAM” format and sorted by coordinate, saving a few additional steps doing this via samtools. As you can see, we provide many more arguments, which you can lookup in the STAR manual; STAR is highly configurable!

5.1.3 Post-processing the alignment result files

STAR returns a number of result files:

  1. a read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam SAM file that contains the alignment in SAM format (Li et al, 2009).

  2. two FASTQ files containing the forward and reverse unmapped reads:

  3. a number of sample-sortmerna-trimmomatic-STARLog.* log files

  4. a number of sample-sortmerna-trimmomatic-SJ.* files containing splice junction information.

The BAM format is very efficient for computers to work with, but as a binary file format, it is unreadble to humans. If you would like to take a look at the alignments, you can do so using samtools:

samtools view star/read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam | head

The sorted BAM file is then indexed

samtools index star/read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam

Furthermore, the FASTQ files containing unaligned reads are renamed to and and are compressed.

The log files, which contain information relating to the processing and splice-junctions, are moved into a log directory.

mkdir star/sample-sortmerna-trimmomatic-STAR\_logs
mv star/sample-sortmerna-trimmomatic-STARLog.* \
star/sample-sortmerna-trimmomatic-SJ.* \
star/sample-sortmerna-trimmomatic-STAR\_logs

Among the log files, Log.final.out and SJ.out.tab are of particular interest. The first details the alignment rate, percentage of uniquely/multiple aligning reads, rate of mismatches, INDELs identified in the reads, etc. The second file describes, in a tabular format, all the EEJs identified by STAR and whether these exist in the provided gff3 file or if they are novel. This is an extremely useful resource that can be used to identify possible new transcript splice variants. One needs to keep in mind that transcription, as all biological processes, is a stochastic process and as such, there will be miss-spliced transcripts present at a low frequency in any RNA-Seq sample that has been sequenced to adequate depth. Hence novel identified junctions might represent low-frequency genuine transcription as well as noise.

6 Pseudo-alignment and read quantification with Kallisto

Kallisto is a fairly recent program that makes heavy use of k-mer hash tables and De-Bruijn graphs to “pseudo-align” reads (Bray et al. 2016). This means that Kallisto does not map a read to a given position in a genome/transcriptome, but rather to an abstract representation of a transcript. Kallisto then performs a likelihood estimation and expectation maximization algorithms to calculate a count and - at request - bootstraps a confidence value.

Step 1 is to create the Kallisto index from a transcriptome:

kallisto index -i Ptrichocarpa_v3.0_210_synthetic-gene-models.idx \
share/Day1/data/reference/fasta/Ptrichocarpa_v3.0_210_synthetic-gene-models.fa

We could modify the k-mer size using the -k option. This is - however - often a very sensitive parameter in k-mer/hash based applications. Lower values will provide higher sensitivity, but lower specificity and vice versa for higher values of k. The optimal choice for a k-mer would be the longest read length without a single mismatch due to natural polymorphisms or technical artefacts. In practise, this is very difficult to estimate unless you have access to a large pool of population genomics data.

Next we can quantify the input data using this newly created index:

kallisto quant -i Ptrichocarpa_v3.0_210_synthetic-gene-models.idx -b 100 \
-o . --pseudobam read-sortmerna-trimmomatic_1.fq.gz read-sortmerna-trimmomatic_2.fq.gz \
> sample-sortmerna-trimmomatic_pseudo.sam

The option -b 100 performs a bootstrap confidence estimation using 100 iterations. --pseudobam tells Kallisto to output a SAM file of pseudoalignments to stdout.

Kallisto outputs several files:

The SAM file can be converted to a sorted BAM and indexed:

samtools view -bT Ptrichocarpa_v3.0_210_synthetic-gene-models.fa \
sample-sortmerna-trimmomatic_pseudo.sam | samtools sort -o ${f/.sam/.bam} \
-T PREF && samtools index sample-sortmerna-trimmomatic_pseudo.bam

7 Visualising the alignments

Visualising alignments can prove useful to assess the data quality or the aligners performance. Let us take a look at the STAR results. We already copied them in a place that JBrowse (the online Genome Browser we will be using, which you can access from the course webpage, by clicking on your “Connect to Apache2” link) can access.

8 Prelude

8.1 Aim

The following six chapters aim at “taking a look under the hood” of how feature summarisation i.e. creating a count table by overlapping the read alignment and the genic annotation information; functions. At the same time, we will toy with computer optimisation (parallel processing and memory management). This will introduce you to an efficient and practical R usage.

Before experimenting with the Bioconductor packages functionalities that were presented in the lecture, we will first sublimate the knowledge you’ve gathered so far into adressing the computational challenges faced when using HTS data: i.e.: resources and time consumption.

In the lecture, the readGAlignments function from the GenomicAlignments package was introduced and used to extract a GAlignments object. However, most of the times, an experiment will NOT consist of a single sample (of about a million reads only!) and an obvious way to speed up the process is to parallelize. In the following three sections, we will see how to perform this before ultimately discussing the pros and cons of the implemented method.

8.2 Creating GAlignment objects from BAM files

First of all, locate the BAM files and implement a function to read them sequentially. Have a look at the lapply function man page for doing so.

   library(GenomicAlignments)
    bamfiles <- dir(file.path(extdata(),"BAM"),
                    pattern="*.bam$",full.names=TRUE)
    gAlns <- lapply(bamfiles,readGAlignments)

Nothing complicated so far, right? We proceed both files sequentially and get a list of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line, we can process all our samples - there is no other gains.

8.3 Processing the files in parallel

Modern laptop possess commonly 2 multi-core CPUs that can perform tasks independently. Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious enhancement to our previous solution is to take advantage of this CPU architecture and to process our samples in parallel.

Have a look at the parallel package and in particular at the mclapply function to re-implement the previous function in a parallel manner.

    library(parallel)
    gAlns <- mclapply(bamfiles,readGAlignments)

8.4 Dealing with memory problems

It is NOT because there were 2 files to proceed. The mclapply has a number of default parameters - see ?mclapply for details - including the one that defaults to 2. If you want to proceed more samples in parallel, set that parameter value accordingly.

This new implementation has the obvious advantage to be X times faster (with X being the number of CPU used, or almost so as parallelization comes with a slight overhead cost), but it put a different strain on the system. As several files are being processed in parallel, the memory requirement also increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine on a computational server but given the constant increase in sequencing reads being produced per run, this will eventually be challenged.

Can you think of the way this memory issue could be adressed? i.e.: what could we modify in the way we read/process the file to limit the memory required at a given moment?

No, buying more memory is usually not an option. And anyway, at the moment, the increase rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the next section to have a go at adressing the issue.

8.5 Processing the files one chunk at a time

To limit the memory required at any moment, one approach would be to proceed the file not as a whole, but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think of how Paired-End data is stored!), we simply can decide to parse, e.g.: 1,000,000 reads at a time. This will of course require to have a new way to represent a BAM file in R, i.e.: not just as a character string as we had it until now in our bamfiles object.

The Rsamtools package again comes in handy. Lookup the ?BamFile help page and try to scheme how we could take advantage of the BamFile or BamFileList classes for our purpose. As the files are small, use a yieldSize of a 100,000 reads.

The parameter of either class looks like exactly what we want. Let us recode our character object into a BamFileList.

    library(Rsamtools)
    bamFileList <- BamFileList(bamfiles,yieldSize=10^5)

Now that we have the BAM files described in a way that we can process them chunk-wise, let us do so. The paradigm is as follow: Note: it is just that: a paradigm, so DO NOT RUN IT but rather inspire yourself from it.

    ## DO NOT RUN ME!
    ## I AM JUST AN EXAMPLE
    open(bamFile)
    while(length(chunk <- readGAlignmentsFromBam(bamFile))){                
          message(length(chunk))
    }
    close(bamFile)

In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e.: these would be 0.1M reads - in our case - apart for the last one, which would be smaller or equal to 0.1M (it is unlikely that a sequencing file contains an exact multiple of our chunk size).

Now, try to implement the above paradigm in the function we implemented previously - remember the solution with mclapply previously - so as to process both our BAM files in parallel and chunk-wise.

    gAlns <- mclapply(bamFileList,function(bamFile){
     open(bamFile)
     gAln <- GAlignments()
     while(length(chunk <- readGAlignments(bamFile))){
       gAln <- c(gAln,chunk)
     }
     close(bamFile)
     return(gAln)
    })

8.6 Pros and cons of the current solution

Before reading my comments below, take the time to jot down what you think are the advantages and drawbacks of the method implemented above. My own comments below are certainly not extensive and I would be curious to hear yours that are not matched with mine.

8.6.1 Pros

  1. We have written a streamlined piece of code, using up to date functionalities from other packages. Hence, it is both easily maintainable and updatable.

  2. With regards to time consumption, we have reduced it by a factor 2 and that can be reduced further by using computer with more CPUs or a compute farm even - obviously if we have more than \(2\) samples to process.

  3. We have implemented the processing of the BAM files by chunk

8.6.2 Cons

  1. There’s only one big con really: we have NOT addressed the memory requirement issue satisfyingly. We do proceed the BAM files by chunks, but then we simply aggregate these chunks without further processing, so we eventually end up using the same amount of memory. This is the best we can do so far given the introduced Bioconductor functionalities, so let us move to the next step in the pipeline that will help us resolve that but first we should recap the usage of the Bioconductor packages for loading and manipulating sequencing read information in R, which is next chapter’s topic.

9 Short Read Alignments

In the former chapter, we have had an initial look at computer optimisation. In the present chapter, we will look into how read alignments are commonly manipulated and what are the principles followed (and caveats faced) by standard tools.

Most down-stream analysis of short read sequences is based on reads aligned to reference genomes. There are many aligners available, including BWA (H. Li and Durbin 2010),(H. Li and Durbin 2009), Bowtie2 (Langmead et al. 2009), GSNAP (Wu and Watanabe 2005), STAR (Dobin et al. 2013); merits of which are discussed in the literature. There are also alignment algorithms implemented in Bioconductor (e.g., matchPDict in the Biostrings package and the gmapR, Rbowtie, Rsubread packages); matchPDict is particularly useful for flexible alignment of moderately sized subsets of data.

9.1 Alignments and Bioconductor packages

The following sections introduce core tools for working with high-throughput sequence data; key packages for representing reads and alignments are summarized in the following table.

Package Description
ShortRead In addition to functionalities to manipulate raw read files, e.g.: the ShortReadQ class and functions for manipulating fastq files; this package offers the possibility to load numerous HTS formats classes. These are mostly sequencer manufacturer specific e.g.: sff for 454 or pre-BAM aligner proprietary formats, e.g.: MAQ or bowtie. These functionalities rely heavily on Biostrings and somewhat on Rsamtools.
GenomicAlignments GAlignments and GAlignmentPairs store single- and paired-end aligned reads and extend on the GenomicRanges functionalities.
Rsamtools Provides access to BAM alignment and other large sequence-related files.
rtracklayer Input and output of bed, wig and similar files.

Read the man page of the GAlignments and GAlignmentPairs classes and pay attention to the very important comments on multi-reads and paired-end processing.

Really just ?GAlignments. However, KEEP these details in mind as they are essential and likely source of erroneous conclusions. Remember the example of this morning lecture about RNA editing.

9.1.1 Alignments and the ShortRead package

Earlier, you looked at the ShortRead package to manipulate raw reads and to perform Quality Assessment (QA) on raw data files e.g.: fastq formatted files. These are not the only functionalities from the ShortRead package, which offers as well the possibility to read in alignments files in many different formats.

Two files of the pasilla dataset have been aligned using bowtie (???), locate them in the extdata folder of the RnaSeqTutorial package.

    bwtFiles <- dir(path=file.path(extdata(),"bowtie"),
                   pattern="*.bwt$",full.names=TRUE)

Have a pick at one of the file and try to decipher its format. Hint: it is a tab delimited format, so check the read.delim function. As you may not want to read all the lines to get an idea, lookup an appropriate argument for that.

You might want to check the bowtie manual for checking whether your guesses were correct. Here is how to read 10 lines of the first file.

    read.delim(file=file.path(extdata(),"bowtie","SRR074430.bwt"),
              header=FALSE,nrows=10)

now, as was presented in the lecture use the readAligned function to read in the bowtie alignment files.

    alignedRead <- readAligned(dirPath=file.path(extdata(),"bowtie"),
                       pattern="*.bwt$",type="Bowtie")

What is peculiar about the returned object? Determine its class. Can you tell where the data from both input files are?

We obtained a single object of the AlignedRead class. By looking at the documentation, i.e.: ?readAligned in the Value section, we are told that all files are concatenated in a single object with NO guarantee of order in which files are read. This is convenient when we want to merge several sequencing runs of the same sample but we need to be cautious and process independent sample by individually calling the readAligned function for every sample.

Finally, taking another look at the lecture, select only the reads that align to chromosome 2L. Hint, use the appropriate SRFilter filter.

    alignedRead2L <- readAligned(dirPath=file.path(extdata(),"bowtie"),
                                pattern="*.bwt$",type="Bowtie",
                       filter=chromosomeFilter("2L"))

This concludes the overview of the ShortRead package. As the BAM format has become a de-facto standard, it is unlikely that you end up using that package to process reads in R, but rather the Rsamtools package presented next.

9.1.2 Alignments and the Rsamtools package

9.1.2.1 Alignment formats

Most main-stream aligners produce output in SAM (text-based) or BAM format. A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an example of a single SAM line, split into fields.

    fl <- system.file("extdata", "ex1.sam", package="Rsamtools")
    strsplit(readLines(fl, 1), "\t")[[1]]
##  [1] "B7_591:4:96:693:509"                 
##  [2] "73"                                  
##  [3] "seq1"                                
##  [4] "1"                                   
##  [5] "99"                                  
##  [6] "36M"                                 
##  [7] "*"                                   
##  [8] "0"                                   
##  [9] "0"                                   
## [10] "CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG"
## [11] "<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7"
## [12] "MF:i:18"                             
## [13] "Aq:i:73"                             
## [14] "NM:i:0"                              
## [15] "UQ:i:0"                              
## [16] "H0:i:1"                              
## [17] "H1:i:0"

Fields in a SAM file are summarized in the following table:

Field Name Value
1 QNAME Query (read) NAME
2 FLAG Bitwise FLAG, e.g., strand of alignment
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition of sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR Extended CIGAR string
7 MRNM Mate Reference sequence NaMe
8 MPOS 1-based Mate POSition
9 ISIZE Inferred insert SIZE
10 SEQ Query SEQuence on the reference strand
11 QUAL Query QUALity
12+ OPT OPTional fields, format TAG:VTYPE:VALUE

We recognize from the FASTQ format introduced earlier, the identifier string, read sequences and qualities. The alignment is to a chromosome ‘seq1’ starting at position 1. The strand of alignment is encoded in the ‘flag’ field. The alignment record also includes a measure of mapping quality, and a CIGAR string describing the nature of the alignment. In this case, the CIGAR is 36M, indicating that the alignment consisted of 36 Matches or mismatches, with no indels or gaps; indels are represented by I and D; gaps (e.g., from alignments spanning introns) by N.

BAM files encode the same information as SAM files, but in a format that is more efficiently parsed by software; BAM files are the primary way in which aligned reads are imported in to R.

9.1.2.2 Aligned reads in R

As introduced in the previous section, there are three different packages to read alignments in R:

  • ShortRead

  • GenomicAlignments

  • Rsamtools

The last two will be described in more details in the next paragraphs.

9.1.2.2.1 GenomicAlignments

The readGAlignments function from the GenomicAlignments package reads essential information from a BAM file into an instance of the GAlignments class. The GAlignments class has been designed to allow useful manipulation of many reads (e.g., 20 million) under moderate memory requirements (e.g., 4 GB).

Use the readGAlignments to read in the “ex1.bam” that can be found in the “extdata” folder of the Rsamtools package.

    alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools")
    aln <- readGAlignments(alnFile)
    head(aln, 3)

The readGAlignments function takes an additional argument: param that allows to streamline the information retrieved from a BAM files, e.g.: to retrieve alignments to known gene coordinates only.

A GAlignments instance is like a data frame, but with accessors as suggested by the column names. It is easy to query, e.g., the distribution of reads aligning to each strand, the length of reads (a.k.a. width), or the alignment cigar strings

Summarize the strand, width and CIGAR information from the aln object.

    table(strand(aln))
## 
##    +    -    * 
## 1647 1624    0
    table(width(aln))
## 
##   30   31   32   33   34   35   36   38   40 
##    2   21    1    8   37 2804  285    1  112
    head(sort(table(cigar(aln)), decreasing=TRUE))
## 
##      35M      36M      40M      34M      33M 14M4I17M 
##     2804      283      112       37        6        4
9.1.2.2.2 Rsamtools

The GenomicAlignments package readGAlignments function only parses some of the fields of a BAM file, and that may not be appropriate for every usage. In such cases the scanBam function in Rsamtools provides greater flexibility. The idea is to view BAM files as a kind of data base. Particular regions of interest can be selected, and the information in the selection restricted to particular fields. These operations are determined by the values of a ScanBamParam object, passed as the named param argument to scanBam.

Consult the help page for ScanBamParam, and construct an object that restricts the information returned by a scanBam query to the aligned read DNA sequence. Your solution will use the what parameter of the ScanBamParam function.

Use the ScanBamParam object to query a BAM file, and calculate the GC content - using the gcFunction function - of all aligned reads. Summarize the GC content as a histogram:

    param <- ScanBamParam(what="seq")
    seqs <- scanBam(bamfiles[[1]], param=param)
    readGC <- gcFunction(seqs[[1]][["seq"]])
    hist(readGC)
9.1.2.2.3 Advanced Rsamtools usage

The Rsamtools package has more advanced functionalities:

  1. functions to count, index, filter, sort BAM files

  2. functions to access the header only

  3. the possibility to access SAM attributes (tags)

  4. functions to manipulate the CIGAR string

  5. the possibility to combine BAM libraries into a study set (BamViews)

Find out the function that permit to scan the BAM header and retrieve the header of the first BAM file in the bigdata() bam subfolder.

It contains the reference sequence length and names as well as the name, version and command line of the tool used for performing the alignments.

    scanBamHeader(bamfiles[1])

The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reported for a read. How can you extract that information from the same first bam file and plot it as an histogram?

    param <- ScanBamParam(tag="NH")
    nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH

So it seems a majority of our reads have multiple alignments! Some processing might be required to deal with these; i.e.: if reads were aligned to the transcriptome there exist tools that can deconvoluate the transcript specific expression, for example MMSEQ (Turro et al. 2011), BitSeq (Glaus et al. 2012), that last one existing as an R package too: BitSeq. Otherwise if reads were aligned to the genome, one should consider filtering these multiple alignments to avoid introducing artifactual noise in the subsequent analyses.

The CIGAR string contains interesting information, in particular, whether or not a given match contain indels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out the intron size distribution?

    param <- ScanBamParam(what="cigar")
    cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar
    cigar.matrix <- cigarOpTable(cigars)
    intron.size <- cigar.matrix[,"N"]
    intron.size[intron.size>0]
    plot(density(intron.size[intron.size>0]))
    hist(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)")

Look up the documentation for the BamViews and using the leeBamViews package, create a BamViews instance. Afterwards, use some of the accessors of that object to retrieve e.g. the file paths or the sample names

    library(leeBamViews)
    bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$")
    gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1]
    geno<-substr(gt,1,nchar(gt)-1)
    lane<-substr(gt,nchar(gt),nchar(gt))
    pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep="."))
    bs1 = BamViews(bamPaths=bpaths, bamSamples=pd,
                   bamExperiment=list(annotation="org.Sc.sgd.db"))
    bamPaths(bs1)
    bamSamples(bs1)

Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample in the object

    sel <- GRanges(seqnames = "Scchr13", 
                   IRanges(start = 861250, end = 863000),
                   strand="+")
    covex = RleList(lapply(bamPaths(bs1),
                           function(x){
                             coverage(readGAlignments(x))[sel][["Scchr13"]]}))

This offer an interesting way to process multiple sample at the same time when you’re interested in a particular locus.

9.1.3 Resources

There are extensive vignettes for Rsamtools and GenomicAlignments packages.

10 Annotation of Genes and Genomes

In the previous chapter, we have seen how aligned reads can be manipulated and what are the crucial caveats to be aware of; i.e. stranded vs. non-stranded data and multi-mapping reads discard. In the present chapter, we will look at how genic annotation are retrieved and processed. Here again we will introduce how this is commonly done and what caveats this implies.

Bioconductor provides extensive annotation resources, summarized in the following figure. These can be gene-, or genome-centric. Annotations can be provided in packages curated by Bioconductor, or obtained from web-based resources.
Gene-centric AnnotationDbi packages include:

Examples of genome-centric packages include:

Web-based resources include

Annotation Packages: the big picture

Annotation Packages: the big picture

Here, we will focus on the Genome-centric type of approaches using the GenomicFeatures package.

10.1 Genome-centric annotations with GenomicFeatures

Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.

  • Load the ‘transcript.db’ package relevant to the dm3 build of D. melanogaster.

  • Use select and friends to select the Flybase gene ids of the top table and the Flybase transcript names (TXNAME) and Entrez gene identifiers (GENEID).

  • Use cdsBy to extract all coding sequences, grouped by transcript.

  • Subset the coding sequences to contain just the transcripts relevant to the top table.

  • How many transcripts are there?

  • What is the structure of the first transcript’s coding sequence?

  • Load the ‘BSgenome’ package for the dm3 build of D. melanogaster.

  • Use the coding sequences ranges of the previous part of this exercise to extract the underlying DNA sequence, using the extractTranscriptSeqs function.

  • Use Biostrings’s translate function to convert DNA to amino acid sequences.

The following loads the relevant Transcript.db package, and creates a more convenient alias to the TranscriptDb instance defined in the package.

    library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
    txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

We can discover available keys (using keys) and columns (columns) in txdb, and then use select to retrieve the transcripts associated with each differentially expressed gene. The mapping between gene and transcript is not one-to-one – some genes have more than one transcripts, which is biologically relevant, and some may have none, which might be due to different database versions being used or database cross-references (xref) inconsistancies.

    set.seed(123)
    fbids <- sample(keys(txdb),10,FALSE,)
    txnm <- select(txdb, fbids, "TXNAME", "GENEID")
## 'select()' returned 1:many mapping between keys and columns
    nrow(txnm)
## [1] 17
    head(txnm, 3)
##        GENEID      TXNAME
## 1 FBgn0031701 FBtr0307036
## 2 FBgn0031701 FBtr0079073
## 3 FBgn0053819 FBtr0091823

The TranscriptDb instances can be queried for data that is more structured than simple data frames, and in particular return GRanges or GRangesList instances to represent genomic coordinates. These queries are performed using cdsBy (coding sequence), transcriptsBy (transcripts), , where the function argument by specifies how coding sequences or transcripts are grouped. Here we extract the coding sequences grouped by transcript, returning the transcript names, and subset the resulting GRangesList to contain just the transcripts of interest to us. The sixth transcript is composed of 8 distinct coding sequence regions.

    cds <- cdsBy(txdb, "tx", use.names=TRUE)[txnm$TXNAME[6]]
    cds[1]
## GRangesList object of length 1:
## $FBtr0073547 
## GRanges object with 8 ranges and 3 metadata columns:
##       seqnames               ranges strand |    cds_id    cds_name
##          <Rle>            <IRanges>  <Rle> | <integer> <character>
##   [1]     chrX [11332696, 11332927]      - |     59110        <NA>
##   [2]     chrX [11331237, 11332272]      - |     59109        <NA>
##   [3]     chrX [11331056, 11331170]      - |     59108        <NA>
##   [4]     chrX [11330808, 11330825]      - |     59107        <NA>
##   [5]     chrX [11329762, 11330429]      - |     59106        <NA>
##   [6]     chrX [11329582, 11329699]      - |     59105        <NA>
##   [7]     chrX [11329416, 11329514]      - |     59104        <NA>
##   [8]     chrX [11327313, 11327450]      - |     59102        <NA>
##       exon_rank
##       <integer>
##   [1]         2
##   [2]         3
##   [3]         4
##   [4]         5
##   [5]         6
##   [6]         7
##   [7]         8
##   [8]         9
## 
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

The following code loads the appropriate BSgenome package; the object refers to the whole genome sequence represented in this package. The remaining steps extract the DNA sequence of each transcript, and translates these to amino acid sequences. Issues of strand are handled correctly.

    library(BSgenome.Dmelanogaster.UCSC.dm3)
    txx <- extractTranscriptSeqs(Dmelanogaster, cds)
    length(txx)
## [1] 1
    head(txx, 3)
##   A DNAStringSet instance of length 1
##     width seq                                          names               
## [1]  2424 ATGACCCTGACCACAACGACC...GTACGGAGCTCTCCACATAA FBtr0073547
    head(translate(txx), 3)
##   A AAStringSet instance of length 1
##     width seq                                          names               
## [1]   808 MTLTTTTTASSAESQAKMDVK...EIDFNDLNMVRRGSTELST* FBtr0073547

10.2 Creating a synthetic set of transcripts

One major caveat estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, can be aligned to several features ( transcripts or genes) if those alignments are of equivalent quality. This happens as a result of gene duplication and the presence of repetitive or common domains, for example. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a “synthetic” transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript-gene structure with a one to one relationship. As this procedure varies from organism to organism, there is, to the best of our knowledge, no tool available for performing this step.

Here, we will inspire ourselves (ha-ha) from the documentation of the R/Bioconductor easyRNASeq package (Delhomme et al. 2012); paragraph 7.1 of the package vignette.

First, we load the necessary libraries

library(IRanges)
library(genomeIntervals)

Next, we read in the GFF3 file

gff <- readGff3(file.path(extdata(),
                          "GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
                quiet=TRUE)

And have a look at the Genome_Intervals object content

gff
## Object of class Genome_intervals_stranded
## 1150406 base intervals and 0 inter-base intervals(*):
## Chr01 - [1660, 2502] 
## Chr01 - [1660, 2502] 
## Chr01 - [1660, 2502] 
## Chr01 - [1660, 2502] 
## ... 1150398 other intervals...
## scaffold_991 + [13343, 13510] 
## scaffold_991 + [13343, 13510] 
## scaffold_991 + [13881, 14048] 
## scaffold_991 + [13881, 14048] 
## 
## annotation:
##   seq_name strand inter_base       source type score phase
## 1    Chr01      -      FALSE phytozome9_0 gene    NA    NA
## 2    Chr01      -      FALSE phytozome9_0 mRNA    NA    NA
## 3    Chr01      -      FALSE phytozome9_0 exon    NA    NA
## 4    Chr01      -      FALSE phytozome9_0  CDS    NA     0
##                                                                              gffAttributes
## 1                                                ID=Potri.001G000100;Name=Potri.001G000100
## 2 ID=PAC:27043735;Name=Potri.001G000100.1;pacid=27043735;longest=1;Parent=Potri.001G000100
## 3                                ID=PAC:27043735.exon.1;Parent=PAC:27043735;pacid=27043735
## 4                                 ID=PAC:27043735.CDS.1;Parent=PAC:27043735;pacid=27043735
## ... 1150398 other intervals...
##             seq_name strand inter_base       source type score phase
## 1150403 scaffold_991      +      FALSE phytozome9_0 exon    NA    NA
## 1150404 scaffold_991      +      FALSE phytozome9_0  CDS    NA     0
## 1150405 scaffold_991      +      FALSE phytozome9_0 exon    NA    NA
## 1150406 scaffold_991      +      FALSE phytozome9_0  CDS    NA     0
##                                                     gffAttributes
## 1150403 ID=PAC:26993235.exon.4;Parent=PAC:26993235;pacid=26993235
## 1150404  ID=PAC:26993235.CDS.4;Parent=PAC:26993235;pacid=26993235
## 1150405 ID=PAC:26993235.exon.5;Parent=PAC:26993235;pacid=26993235
## 1150406  ID=PAC:26993235.CDS.5;Parent=PAC:26993235;pacid=26993235
nrow(gff[gff$type=="exon",])
## [1] 460072
nrow(gff[gff$type=="mRNA",])
## [1] 73013
nrow(gff[gff$type=="gene",])
## [1] 41335

Here, we do not ask you to understand every detail of the implementation, but rather to get an idea of the process.

First, we identify the ID and Parents of all mRNA features and create a mapping table from it.

sel <- gff$type == "mRNA"
transcriptGeneMapping <- data.frame(getGffAttribute(gff[sel], "ID"), 
                                    getGffAttribute(gff[sel], "Parent")
)
head(transcriptGeneMapping)
##             ID           Parent
## 1 PAC:27043735 Potri.001G000100
## 2 PAC:27045395 Potri.001G000200
## 3 PAC:27045442 Potri.001G000300
## 4 PAC:27046907 Potri.001G000400
## 5 PAC:27046908 Potri.001G000400
## 6 PAC:27046909 Potri.001G000400

Next, we select the exon features and sort them in “groups” by their Parent.

sel <- gff$type=="exon"
rngList<- split(IRanges(start=gff[sel,1],end=gff[sel,2]),
                transcriptGeneMapping[match(
                  sapply(strsplit(getGffAttribute(gff[sel,],"Parent"),","),"[",1),
                  transcriptGeneMapping$ID),"Parent"])
rngList
## IRangesList of length 41335
## $Potri.001G000100
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      1660      2502       843
## 
## $Potri.001G000200
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      6501      6646       146
##   [2]      3506      3928       423
##   [3]      2906      3475       570
## 
## $Potri.001G000300
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      8391      8860       470
## 
## ...
## <41332 more elements>
mostExons <- rev(names(table(elementNROWS(rngList))))[1]
mostExons
## [1] "659"

Then, we reduce the exon structure

rngList<- IRanges::reduce(rngList)
rngList
## IRangesList of length 41335
## $Potri.001G000100
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      1660      2502       843
## 
## $Potri.001G000200
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      2906      3475       570
##   [2]      3506      3928       423
##   [3]      6501      6646       146
## 
## $Potri.001G000300
## IRanges object with 1 range and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]      8391      8860       470
## 
## ...
## <41332 more elements>
rev(names(table(elementNROWS(rngList))))[1]
## [1] "77"

Once we have reduced the transcript complexity, we can reconstruct a GFF structure from it.

First, we get the gff information; here we simply duplicate the first exon of every gene by the number of synthetic exons per gene. The content will be updated afterwards.

exons <- gff[sel,]
syntheticGeneModel<- exons[rep(
  match(names(rngList),
        transcriptGeneMapping[
          match(sapply(strsplit(getGffAttribute(exons,"Parent"),","),"[",1),
                transcriptGeneMapping$ID),"Parent"]),
  elementNROWS(rngList)),]

Then, we update the coordinates, and change the source. This latter step is to document our changes; i.e. to make it obvious to anyone using that the newly generated annotation are not the original ones.

syntheticGeneModel[,1]<- unlist(start(rngList))
syntheticGeneModel[,2]<- unlist(end(rngList))
levels(syntheticGeneModel$source)<- "inhouse"

Next, we get the exon number for both strands. As the exons are correctly ordered on the minus strands by our first command - i.e. in decreasing order, we have to revert those on the plus strand.

exonNumber<- lapply(elementNROWS(rngList),":",1)
sel<- strand(syntheticGeneModel)[cumsum(elementNROWS(rngList))] == "+"
exonNumber[sel]<- sapply(exonNumber[sel],rev)

This is followed by the attributes (the ninth column) update.

syntheticGeneModel$gffAttributes<- paste("ID=",
                                         rep(names(rngList),elementNROWS(rngList)),
                                         ":",unlist(exonNumber),";Parent=",
                                         rep(names(rngList),elementNROWS(rngList)),".0",sep="")

Before, eventually writing the newly generated annotation in different formats.

writeGff3(syntheticGeneModel,file="~/Ptrichocarpa_v3.0_210_synthetic_transcripts.gff3")

sel <- syntheticGeneModel$type=="exon"
annot <- split(GRanges(seqnames=seq_name(syntheticGeneModel[sel]),
                       ranges=IRanges(start=syntheticGeneModel[sel,1],
                                      end=syntheticGeneModel[sel,2]),
                       strand=strand(syntheticGeneModel[sel])),
               getGffAttribute(syntheticGeneModel,"Parent")[sel,1]
)
## Warning: 'seq_name' is deprecated.
## Use 'seqnames' instead.
## See help("Deprecated")
save(annot,file="~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda")

10.2.1 Alternative

Note: The same can actually be done with a single call to the “easyRNASeq” “createSyntheticTranscripts” function

synthTrx <- createSyntheticTranscripts(
    file.path(extdata(),"GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
    verbose=FALSE)

11 Interlude

In this chapter, we will shortly go back to computational optimisation.

    library(RnaSeqTutorial)
    bamfiles <- dir(file.path(extdata(),"BAM"),
                   pattern="*.bam$",full.names=TRUE)
    bamFileList <- BamFileList(bamfiles,yieldSize=10^6)

As you have been introduced to the GenomicAlignments functionalities to find count or summarize overlaps between reads and annotations, we can refine our prefered function. We had left it as:

    gAlns <- mclapply(bamFileList,function(bamFile){
     open(bamFile)
     gAln <- GAlignments()
     while(length(chunk <- readGAlignments(bamFile))){
       gAln <- c(gAln,chunk)
     }
     close(bamFile)
     return(gAln)
    })

Using our synthetic transcript annotation: Ptrichocarpa_v3.0_210_synthetic_transcripts.rda, implement the count by chunks.

    data("Ptrichocarpa_v3.0_210_synthetic_transcripts")
    count.list <- mclapply(bamFileList,function(bamFile,annot){
    open(bamFile)
    counts <- vector(mode="integer",length=length(annot))
    while(length(chunk <- readGAlignments(bamFile))){
       counts <- counts + assays(summarizeOverlaps(annot,
                                                   chunk,
                                                   mode="Union",
                                                   ignore.strand=TRUE))$counts
    }
    close(bamFile)
    return(counts)
    },syntheticTrx[syntheticTrx$type == "exon"])

This gives us a list of counts per sample, to get a count matrix do:

    count.table <- do.call("cbind",count.list)
    head(count.table[rowSums(count.table)>0,])
##      reads reads
## [1,]     2     3
## [2,]     3     2
## [3,]     2     0
## [4,]    12     5
## [5,]     2     1
## [6,]     0     3

Such a count table object is the minimal input that downstream analysis softwares - e.g.: DESeq2, edgeR, uses.

A similar function to this is probably all you“ll need to process your read and get a count table from a standard Illumina based RNA-Seq experiment. However, you might want more flexibility for you projects and certainly Bioconductor offer the possibilities to do that; examples of which are given in the next chapter.

In this chapter, now that we are familiar with what feature summarisation means, we will run one of the de-facto tools, namely “HTSeq”.

12 Count table creation

The read alignment step performed previously concluded the data pre-processing common to the majority of RNA-Seq based experiments. The following table details typical observed number of read sequences available following the data filtering and alignment steps. There are subsequently a large number of choices for performing downstream analyses for mRNA-Seq data. Probably the most common are to identify differential expression between conditions or to analyse sequence variants (Single Nucleotide Polymorphisms (SNP), INDELs (INsertion/DELetion), Copy Number Variants (CNVs)). However, some of these analyses, DE analysis for example - the topic of the remainder of this tutorial - require additional data-preparation.

Step Input Data Usable reads % of total % removed from previous step
Raw raw reads 1,000,000 100 0
SortMeRna Raw reads 970,000 - 990,000 97 - 99 1 - 3
Trimommatic Filtered reads / raw read 776,000 - 891,000 78 - 89 10 - 20
Aligner4 (STAR) Trimmed / Filtered / raw reads 620,800 - 801,900 62 - 81 10 - 205
Analysis Aligned reads 620,800 - 801,900 62 - 81 0

This data preparation varies depending on whether expression at the gene or the transcript level is required. In our case, we are interested in gene expression, as we would like to perform a differential gene expression study.

12.1 Data preparation for a DE analyses at the gene level

A typical DE analysis data preparation consists of three steps, the first being to generate a non-redundant annotation, followed by the quantification/summation of the pre-processed reads aligned to each such feature before ultimately a QC is performed that assess whether the observed effects may have biological causes.

12.1.1 Creating a non redundant annotation

One major caveat estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, can be aligned to several features ( transcripts or genes) if those alignments are of equivalent quality. This happens as a result of gene duplication and the presence of repetitive or common domains, for example. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a “synthetic” transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript-gene structure with a one to one relationship. As this procedure varies from organism to organism, there is, to the best of our knowledge, no tool available for performing this step. The documentation of the R/Bioconductor easyRNASeq package (Delhomme et al. 2012) - see paragraph 7.1 of the package vignette details a way of doing this in R starting from a GTF/GFF3 annotation file. From the “genome.gff3” that was used during the alignment step, we obtain a synthetic-transcript.gff3 file.

12.1.2 Counting reads per feature

The second step is to perform the intersection between the aligned position of reads (contained in the alignment BAM file) and the gene coordinates obtained in the previous step, to count the number of reads overlapping a gene. There are two primary caveats here: First the annotation collapsing process detailed above works on a gene-by-gene basis and hence is oblivious to the existence of genes that may overlap another gene encoded on the opposite strand. Second, aligners may return multiple mapping positions for a single read. In the absence of more adequate solution - see the next section on “DE analysis at the transcript level” for a example of what may be done - it is best to ignore multi-mapping reads.

A de-facto standard for counting is the htseq-count tool supplied as part of the HTSeq python library (Anders, Pyl, and Huber 2014). This associated webpage illustrates in greater detail the issues discussed above. For non-strand specific reads we suggest running htseq-count as follows:

htseq-count -f bam -r pos -m union -s no -t exon -i Parent \
sample-sortmerna-trimmomatic-STAR.bam synthetic-transcript.gff3 > \
sample-sortmerna-trimmomatic-STAR-HTSeq.txt

whereas for stranded data we advise using the following:

htseq-count -f bam -r pos -m intersection-nonempty -s reverse -t exon \
-i Parent sample-sortmerna-trimmomatic-STAR.bam synthetic-transcript.gff3 > \
sample-sortmerna-trimmomatic-STAR-HTSeq.txt

Using your STAR aligned BAM file, and the gff3 file available there: ~/share/Day1/data/reference/gff/Ptrichocarpa_v3.0_210_synthetic-gene-models-wo-introns.gff3, use the HTSeq htseq-count utility to create the count table for your sample (our data is not strand specific).

Before you proceed further, read about the count modes in HTSeq and think about why we choose the union mode.

This last chapter builds on what we have done in the five previous chapters and extend on the use of the R/Bioconductor packages. These give an extended flexibility of usage over tools such as HTSeq. Very few tools are addressing the caveats we mentioned in the previous chapters, and this is still a field under very active development. The aim of this chapter is to give you some extra insight, i.e. pointers; in case you would in your research be affected by one of the mentioned (or not) caveats.

13 Estimating Expression over Genes and Exons

    library(RnaSeqTutorial)
    bamfiles <- dir(file.path(extdata(),"BAM"),
                   pattern="*.bam$",full.names=TRUE)

This chapter6 describes part of an RNA-Seq analysis use-case. RNA-Seq (Mortazavi and others 2008) was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.

13.1 Counting reads over known genes and exons

The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc.

To achieve this, we need to take advantage of all the steps performed previously in the workshop.

  1. the alignments information has to be retrieved

  2. the corresponding annotation need to be fetched and possibly adapted e.g.: as was done in the preceding lecture.

  3. the read coverage per genic feature of interest determined

Can you associate at least a Bioconductor package to every of these tasks?

There are numerous choices, as an example in the following we will go for the following set of packages:

  1. Rsamtools

  2. genomeIntervals

  3. GenomicAlignments

13.1.1 The alignments

In this section we will import the data using the GenomicAlignments package readGAlignments function. This will create a GAlignments object that contains only the reads that aligned to the genome.

Using what was introduced in the alignments section, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific.

First we scan the bam directory:

    bamfiles <- dir(file.path(extdata(), "BAM"), ".bam$", full=TRUE)
    names(bamfiles) <- sub("_.*", "", basename(bamfiles))

Then we read the first file:

    aln <- readGAlignments(bamfiles[1])
    strand(aln) <- "*"

As we have seen, many of these reads actually align to multiple locations. In a first basic analysis - i.e.: to get a feel for the data - such reads could be ignored.

Filter the multiple alignment reads. Think of the “NH” tag.

    param <- ScanBamParam(tag="NH")
    nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
    aln <- aln[nhs==1,]

Now that we have the alignments ( object) and the synthetic transcript annotation ( object) - the one from the lecture; the same used in the Interlude ; we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might erroneously be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure 1:

Overlap modes: Image from the HTSeq package developed by Simon Anders.

Overlap modes: Image from the HTSeq package developed by Simon Anders.

The GenomicAlignments package summarizeOverlaps function offer different possibilities to summarize reads per features:

    load("~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda")
    annot <- syntheticTrx[syntheticTrx$type=="mRNA",]
    counts1 <- summarizeOverlaps(annot, aln, mode="Union")
    counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict")
    counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")

Create a data.frame or a matrix of the results above and figure out if any differences can be observed. E.g check for difference in the row standard deviation (using the apply and sd functions).

    synthTrxCountsTable <- data.frame( 
                                 assays(counts1)$counts,
                                 assays(counts2)$counts,
                                 assays(counts3)$counts)
    colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty")
    rownames(synthTrxCountsTable) <- rownames(counts1)
    sds <- apply(synthTrxCountsTable,1,sd)
    sum(sds!=0)
    sum(sds!=0)/length(sds)
    
    synthTrxCountsTable[which.max(sds),]
    syntheticTrx[which.max(sds),]

So it appears that we have 3,176 cases where the counts differ; 8% of the total!!), and that the synthetic transcript “Potri.001G244700.0” shows the largest difference.

For a detailed analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the “union” set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features: RSEM (B. Li et al. 2010), cufflinks (C. Trapnell et al. 2010), MMSeq (Turro et al. 2011).

This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified.

13.2 Discovering novel transcribed regions

Note: This section is not essential to the workshop

One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks (C. Trapnell et al. 2010), Oases (Schulz et al. 2012), Trinity (Grabherr et al. 2011), to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e.: R rules.

But as a start and to make the results more precise, the reads have been realigned using STAR (Dobin et al. 2013), a very fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), see this website for more details. This MEM approach - actually a derivative called MMP (Maximum Mappable Prefix) - allow STAR to identify exon-exon junctions without prior knowledge e.g.: no need for an annotation gff. To start, we re-read one of the sample alignments using the GenomicAlignments package readGAlignments function.

    aln <- readGAlignments(
     BamFile(file.path(extdata(),"BAM","207_subset_sortmerna_trimmomatic_sorted.bam")))

13.2.1 Defining transcribed regions

The process begins with calculating the coverage, using the method from the GenomicAlignments package:

    cover <- coverage(aln)
    cover
    # this object is compressed to save space. It is an RLE (Running Length Encoding)
    # we can look at a section of chromosome 4 say between bp 1 and 1000
    # which gives us the number of read overlapping each of those bases
    as.vector(cover[["Chr19"]])[6553903:6561936] 

The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes.

The next step is to define, “islands” of expression. These can be created using the slice function. The peak height for the islands can be determined with the viewMaxs function and the island widths can be found using the width function:

    islands <- slice(cover, 1)
    islandPeakHeight <- viewMaxs(islands)
    islandWidth <- width(islands)

While some more sophisticated approaches can be used to find exons de novo, we can use a simple approach whereby we select islands whose maximum peak height is 2 or more and whose width is 150bp (150%) of the read size) or more to be candidate exons. The elementLengths function shows how many of these candidate exons appear on each chromosome:

    candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=150L]
    candidateExons[["Chr19"]]

Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome.

    sum(cigarOpTable(cigar(aln))[,"N"] > 0)
## [1] 473809

There are 473,809 reads that span exon-exon junctions (EEJs).

Let“s look up such a potential EEJ:

    aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
## GAlignments object with 13782 alignments and 0 metadata columns:
##           seqnames strand        cigar    qwidth     start       end
##              <Rle>  <Rle>  <character> <integer> <integer> <integer>
##       [1]    Chr19      -   61M311N40M       101     90848     91259
##       [2]    Chr19      + 58M16870N31M        89    119801    136759
##       [3]    Chr19      +   88M997N13M       101    131190    132287
##       [4]    Chr19      -   64M997N37M       101    131214    132311
##       [5]    Chr19      +   31M997N68M        99    131247    132342
##       ...      ...    ...          ...       ...       ...       ...
##   [13778]    Chr19      +   49M433N41M        90  15847572  15848094
##   [13779]    Chr19      - 8S75M881N18M       101  15848054  15849027
##   [13780]    Chr19      -  60M1114N41M       101  15848069  15849283
##   [13781]    Chr19      -   18M256N83M       101  15849314  15849670
##   [13782]    Chr19      -   91M169N10M       101  15849618  15849887
##               width     njunc
##           <integer> <integer>
##       [1]       412         1
##       [2]     16959         1
##       [3]      1098         1
##       [4]      1098         1
##       [5]      1096         1
##       ...       ...       ...
##   [13778]       523         1
##   [13779]       974         1
##   [13780]      1215         1
##   [13781]       357         1
##   [13782]       270         1
##   -------
##   seqinfo: 1446 sequences from an unspecified genome
    aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",][3:6,]
## GAlignments object with 4 alignments and 0 metadata columns:
##       seqnames strand        cigar    qwidth     start       end     width
##          <Rle>  <Rle>  <character> <integer> <integer> <integer> <integer>
##   [1]    Chr19      +   88M997N13M       101    131190    132287      1098
##   [2]    Chr19      -   64M997N37M       101    131214    132311      1098
##   [3]    Chr19      +   31M997N68M        99    131247    132342      1096
##   [4]    Chr19      - 10M997N90M1S       101    131268    132364      1097
##           njunc
##       <integer>
##   [1]         1
##   [2]         1
##   [3]         1
##   [4]         1
##   -------
##   seqinfo: 1446 sequences from an unspecified genome

There are respectively 4 reads spanning what looks like introns of 997 bp. Note that the GenomicAlignments package Galignments class is aware of splicing junctions. Have a look at the coverage for the first intron:

    cover[["Chr19"]][131278:132275]
## integer-Rle of length 998 with 8 runs
##   Lengths:   2  13 473   1  89 418   1   1
##   Values :   2   1   0   1   2   0   1   5

Now, we can have a look if we can identify a specific motif around the EEJ. We cheery pick - almost at random really - 2 EEJs, with 7 supporting reads each. Then using their cigar string, we calculate the position of the putative acceptor and donor sites.

    splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
    cherry.pick.sel <- c(13492:13498,13511:13517)
    read.start <- start(splice.reads)[cherry.pick.sel]
    donor.pos <- read.start - 1 +
    as.numeric(sapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M"),"[",1))
     
    acceptor.pos <- read.start - 1 +
     sapply(
       lapply(
         lapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M|N"),"[",1:2),
         as.integer),
       sum)

Then we load the chromosome Chr19 sequence, so that we can determine the sequence in the vicinity of the EEJs.

    Chr19 <- readDNAStringSet(file.path(extdata(),
                                       "FASTA",
                                       "Ptrichocarpa_v3.0_210-Chr19.fa.gz"))

And retrieve the sequences spanning the EEJs. We pad each of them by 10bp on both sides.

    donor <- Views(subject=Chr19[[1]],
                        start=donor.pos-8,
                        end=donor.pos+11)
    acceptor <- Views(subject=Chr19[[1]],
                      start=acceptor.pos-10,
                       end=acceptor.pos+9)

Oups, is not something missing in our reasoning? Yes, indeed, we have not considered the strand until now! A visual evaluation told us that the mRNA is probably on the minus strand whereas the second is on the plus strand; we consquently correct the sequences created above.

    donor <- DNAStringSet(donor)
    minus.acceptor <- reverseComplement(donor[1:7])

    acceptor <- DNAStringSet(acceptor)
    minus.donor <- reverseComplement(acceptor[1:7])

    donor[1:7] <- minus.donor
    acceptor[1:7] <- minus.acceptor

Let“s see if there”s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites.

    library(seqLogo)
    pwm <- makePWM(cbind(
     alphabetByCycle(donor)[c("A","C","G","T"),]/length(donor),
     alphabetByCycle(acceptor)[c("A","C","G","T"),]/length(acceptor))
    )
    seqLogo(pwm)
GC content in aligned reads

GC content in aligned reads

Clearly the logo in the figure above is not exceptional, but from only 2 EEJs, we can already see that the donor site at position 10-11 is GT and the acceptor site at position 30-31 is AG, i.e.: the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient.

This concludes the section on summarizing counts. As you could realize, juggling with the different package for manipulating the alignment and annotation requires some coding. To facilitate this a number of “workflow” package are available at Bioconductor. The next section gives a brief introduction of easyRNASeq (obviously, a biased selection “)

13.3 Using easyRNASeq

Let us redo what was done in the previous section using the easyRNASeq package.

  • load the easyRNASeq package and its companion data package: RnaSeqTutorial.

  • Look up the help page on the simpleRNASeq function or use show to lookup that function definition and arguments.

  • The data package contains 4 BAM files located as indicated below. Create a BamFileList object from them.

  • The data package also contains a gff3 formated file - given below. Create a AnnotParam object from it.

  • Knowing that the RNA-Sequencing protocol was single-end and unstranded, create the appropriate BamParam object.

  • Finally, combine the AnnotParam and BamParam into a RnaSeqParam object and call the simpleRNASeq function, summarizing the reads by exons.

The BAM files are:

    dir(file.path(extdata(),
                  pattern="^[A,T].*\\.bam$",
                  full.names=TRUE)

The GFF3 file is:

    system.file(
       "extdata",
       "Dmel-mRNA-exon-r5.52.gff3",
       package="RnaSeqTutorial")

And now we can proceed:

    ## we start by loading the packages
    library("easyRNASeq")
    library("RnaSeqTutorial")
    
    ## looking up the function definition
    show(simpleRNASeq)
    
    ## creating the BamFileList
    bamFileList <- getBamFileList(
                    dir(path=system.file("extdata",
                        package="RnaSeqTutorial"),
                        pattern="^[A,T].*\\.bam$",
                        full.names=TRUE))
    
    ## creating the AnnotParam object
    annotParam <- AnnotParam(datasource=system.file(
                           "extdata",
                           "Dmel-mRNA-exon-r5.52.gff3",
                           package="RnaSeqTutorial"))
    
    ## creating the BamParam object
    bamParam <- BamParam(paired=FALSE,stranded=FALSE)
    
    ## creating the RnaSeqParam 
    rnaSeqParam <- RnaSeqParam(annotParam=annotParam,
                              bamParam=bamParam,
                              countBy="exons")
    
    ## and we then get a SummarizedExperiment containing the counts table
    sexp <- simpleRNASeq(
       bamFiles=bamFileList,
       param=rnaSeqParam,
       verbose=TRUE
       )

    ## and look at the counts
    exon.count <- assays(sexp)$exons

    head(exon.count[rowSums(exon.count)>0,])

And that is it. In a few commands - one really - you have achieved what it takes us all the day to do!

Try to re-run simpleRNASeq with a minimalistic set of argument to check it“s parameter detection ability.

    sexp <- simpleRNASeq(
       bamFiles=bamFileList,
       param=RnaSeqParam(annotParam=annotParam),
       verbose=TRUE
       )

13.3.1 Details

The simpleRNASeq function currently accepts the following type of datasource - specified through an AnnotParam object:

  • “biomaRt” use biomaRt to retrieve the annotation

  • “env” use a RangedData or GRanges class object present in the environment

  • “gff” reads in a gff version 3 file

  • “gtf” reads in a gtf file

  • “rda” load an RData object. The object needs to be named and of class RangedData or GRanges.

The reads can be summarized by:

  • exons

  • features (any features such as introns, enhancers, etc.)

  • transcripts

  • genes Ideally these are the result of a process similar to the one from the lecture where a gene is represented by the set of non overlapping loci (i.e. synthetic exons) that represents all the possible exons and UTRs of a gene. Such as I call them “synthetic transcripts” are essential when counting reads as they ensure that no reads will be accounted for several times. E.g., a gene can have different isoforms, using different exons, overlapping exons, in which case summarizing by exons might result in counting a read several times, once per overlapping exon. See the section \(7.1\) of the easyRNASeq vignette for details.

The results are returned in an SummarizedExperiment class object.

Have a closer look at the SummarizedExperiment object .

See the GenomicRange package SummarizedExperiment class for more details on last three accessors used in the following.

    ## the counts
    head(assays(sexp)[[1]])
    ## some non empty counts
    head(assays(sexp)[[1]][rowSums(assays(sexp)[[1]])!=0,])
    ## the sample info
    colData(sexp)
    ## the 'features' info
    rowRanges(sexp)

For more details and a complete overview of the easyRNASeq package capabilities, have a look at the easyRNASeq vignette.

    vignette("easyRNASeq")

13.3.2 Caveats

easyRNASeq is still under active development and as such the current version may still behave unexpectedly.

  1. In addition, advanced checks are conducted on the data provided by the user to make decision on the overall process. This may need refinement.

  2. The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html by Robinson et al. have been adressed. Both the original easyRNASeq method and the GenomicAlignments approach are provided, the later one being the default.

  3. Integration of recent tools for Differential Expression expression analysis such as DESeq2 and DEXSeq is pending. Planned is an integration of the limma for enabling the voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to be conducted based on the report by Soneson and Delorenzi(Soneson and Delorenzi 2013).

13.4 Where to from here

After obtaining the count table, numerous downstream analyses are available. Most often, such count tables are generated in a differential expression experimental setup. In that case, packages such as DESeq, DEXSeq, edgeR, limma (see voom+limma in the limma vignette), are some of the possibilities available in Bioconductor. Have a look at (Dillies et al. 2012) and (Soneson and Delorenzi 2013) to decide which tool/approach is the best suited for your experimental design. But, of course, counts can as well be used for other purposes such as visualization, using e.g.: the rtracklayer and GViz packages. Actually, there“s no real limitation of what one can achieve with a count table and it does not need be an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data for differential binding analyses.

14 Sexual dimorphism study Biological QA

    library(RnaSeqTutorial)

14.1 (Re)Introduction

As a running example, we use a dataset derived from a study performed in Populus tremula (Robinson, Delhomme et al., BMC Plant Biology, 2014}. The authors test the evolutionary theory suggesting that males and females may evolve sexually dimorphic phenotypic and biochemical traits concordant with each sex having different optimal strategies of resource investment to maximise reproductive success and fitness. Such sexual dimorphism should result in sex biased gene expression patterns in non-floral organs for autosomal genes associated with the control and development of such phenotypic traits. Hence, the authors, among other approaches have looked for gene expression differences between sex. This was achieved by an RNA-Seq differential expression analysis between samples grouped by sex. The samples have been sequenced on an Illumina HiSeq 2000, using TruSeq adapters, through a 101 cycles paired-end protocol, which yielded between 11 million and 34 million reads per sample.

14.2 Processing the data

14.2.1 Reading in the data

First we read the HTSeq files in a matrix. The DESeq2 package now actually has a function to ease that process: DESeqDataSetFromHTSeqCount. Here we just process the samples in parallel using mclapply instead.

res <- mclapply(dir(file.path(extdata(),"htseq"),
                    pattern="^[2,3].*_STAR\\.txt",
                    full.names=TRUE),function(fil){
  read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
},mc.cores=2)
names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
                                           pattern="^[2,3].*_STAR\\.txt"))

Then we extract the additional info that HTSeq writes at the end of every file detailing how many reads are aligned uniquely or not, have no or ambigous overlaps or are of too low a quality. That last category should not occur at this stage of the analysis, as we have sorted and trimmed the data before aligning.

addInfo <- c("__no_feature","__ambiguous",
             "__too_low_aQual","__not_aligned",
             "__alignment_not_unique")

and we then extract the reads

sel <- match(addInfo,res[[1]][,1])
count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
colnames(count.table) <- names(res)
rownames(count.table) <- res[[1]][,1][-sel]

14.2.2 The HTSeq stat lines

Here we aggreagte the information about how many reads aligned, how many were ambiguous, etc

count.stats <- do.call(cbind,lapply(res,"[",2))[sel,]
colnames(count.stats) <- names(res)
rownames(count.stats) <- sub("__","",addInfo)
count.stats <- rbind(count.stats,aligned=colSums(count.table))
count.stats <- count.stats[rowSums(count.stats) > 0,]

Then we convert them as percentages and check them all

apply(count.stats,2,function(co){round(co*100/sum(co))})
##                      202 207 213.1 221 226.1 229.1 229 235 236 239 244 303
## no_feature             7   8     7   7     7     7   8   7   7   7   8   7
## ambiguous              4   4     4   4     4     4   4   4   4   3   4   4
## alignment_not_unique   3   3     3   3     3     3   3   3   3   3   3   3
## aligned               86  85    86  87    86    86  86  86  86  86  86  86
##                      305.3 309.1 310.3 337.1 349.2
## no_feature               7     7     7     7     8
## ambiguous                4     4     4     4     4
## alignment_not_unique     3     3     3     3     3
## aligned                 87    86    86    86    85

As can be seen an average 86% of the reads aligned uniquely unambiguously to features. About 3% were mapping multiple locations, 4% were on ambigous features (as the data is non strand specific, we have used the “Union” counting scheme) and finally 7% align to no features. This is not at all unexpected given the fact that the alignments were done against the the draft of the P. tremula genome.

14.2.3 Visualizing the HTSeq stats

pal=brewer.pal(6,"Dark2")[1:nrow(count.stats)]
mar <- par("mar")
par(mar=c(7.1,5.1,4.1,2.1))
barplot(as.matrix(count.stats),col=pal,beside=TRUE,las=2,main="read proportion",
        ylim=range(count.stats)+c(0,1e+7))
legend("top",fill=pal,legend=rownames(count.stats),bty="n",cex=0.8,horiz=TRUE)

par(mar=mar)

Here, one can clearly see a library size effect; the samples 229.1 and 349.2 have been sequenced deeper as these are the parents of a large scale F1 population established for another study. Appart from these, the amount of reads per library seems fairly equally distributed.

14.2.4 Deriving more information

We can estimate how much of the genes are not expressed

sel <- rowSums(count.table) == 0
sprintf("%s percent",round(sum(sel) * 100/ nrow(count.table),digits=1))
## [1] "13.1 percent"
sprintf("of %s genes are not expressed",nrow(count.table))
## [1] "of 35309 genes are not expressed"

So 13.1% of the genes are not expressed out of a total of 35309 genes

14.3 Biological QA

To assess the validity of the replicates, we can look at different metrics.

14.3.1 Raw count distribution

We start by looking at the per-gene mean expression, i.e. the mean raw count of every gene across samples is calculated and displayed on a log10 scale

plot(density(log10(rowMeans(count.table))),col=pal[1],
     main="mean raw counts distribution",
     xlab="mean raw counts (log10)")

Then the same is done for the individual samples colored by sample type

pal=brewer.pal(8,"Dark2")
plot.multidensity(log10(count.table),col=rep(pal,each=3),
                  legend.x="topright",legend.cex=0.5,
                  main="sample raw counts distribution",
                  xlab="per gene raw counts (log10)")

## NULL

As one can see, most samples show the similar trends, a few samples are shifted to the right - those that were sequenced deeper, but this does not affect the global shape of the curve.

14.4 Variance stabilisation for better visualisation

For visualization, the data is submitted to a variance stabilization transformation using DESeq2. The dispersion is estimated independently of the sample type We nonetheless define the metadata fo importance, i.e. every sample sex and year of sampling

samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
conditions <- names(res)
sex <- samples$sex[match(names(res),samples$sample)]
date <- factor(samples$date[match(names(res),samples$sample)])

And then we create the DESeq2 object, using the sample name as condition (which is hence irrelevant to any comparison)

dds <- DESeqDataSetFromMatrix(
  countData = count.table,
  colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
  design = ~ condition)

Once this is done, we first check the size factors and as there is no big variation, we decide to go for a variance stabilisation transformation over a regularised log transformation. Check the DESeq2 vignette for more details about this.

dds <- estimateSizeFactors(dds)
sizes <- sizeFactors(dds)
names(sizes) <- names(res)
sizes
##   202   207 213.1   221 226.1 229.1   229   235   236   239   244   303 
##  0.79  1.06  0.88  0.81  0.91  1.89  0.93  1.51  1.09  1.06  0.90  0.99 
## 305.3 309.1 310.3 337.1 349.2 
##  0.65  0.71  1.10  0.80  1.94
boxplot(sizes,main="relative library sizes",ylab="scaling factor")

Let us do the vst

colData(dds)$condition <- factor(colData(dds)$condition,
                                 levels=unique(conditions))
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
vst <- assay(vsd)
colnames(vst) <- colnames(count.table)

The vst introduces an offset, i.e. non expressed gene will have a value set as the minimal value of the vst. To avoid lengthy unnecessary explanations, we simply shift all values so that non expressed genes have a vst value of 0

vst <- vst - min(vst)

It is then essential to validate the vst effect and ensure its validity. To that extend, we visualize the corrected mean - sd relationship. As it is fairly linear, meaning we can assume homoscedasticity. the slight initial trend / bump is due to genes having few counts in a few subset of the samples and hence having a higher variability. This is expected.

meanSdPlot(assay(vsd)[rowSums(count.table)>0,])

Here is what would be observed by a log2 transformation of the raw count data

meanSdPlot(log2(as.matrix(count.table[rowSums(count.table)>0,])))
## Warning: Removed 8894 rows containing non-finite values (stat_binhex).

or for the count adjusted for the library size factor

meanSdPlot(log2(counts(dds,normalized=TRUE)[rowSums(count.table)>0,]))
## Warning: Removed 8894 rows containing non-finite values (stat_binhex).

14.4.1 PCA analysis

We perform a Principal Component Analysis (PCA) of the data to do a quick quality assessment, i.e. replicate should cluster and the first dimensions should be explainable by biological means.

pc <- prcomp(t(vst))
percent <- round(summary(pc)$importance[2,]*100)
smpls <- conditions

We color the samples by date and sex

sex.cols<-c("pink","lightblue")
sex.names<-c(F="Female",M="Male")

And then check wheter the observed separation is due to the sample sex

scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=sex.cols[as.integer(factor(sex))],
              pch=19)
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))

par(mar=mar)

This does not seem to be the case at all. There are 2 clusters of points, but both equally contain pink and blue dots. So, we next color by sampling date

dat <- date
scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=pal[as.integer(factor(dat))],
              pch=19)
legend("topleft",pch=rep(c(19,23),each=10),col=rep(pal,2),legend=levels(factor(dat)),bty="n")

par(mar=mar)

And here we are… The sampling data is a STRONG CONFOUNDING FACTOR in our analysis. So let’s do a final plot with the color = sex and shape = date

scatterplot3d(pc$x[,1],
              pc$x[,2],
              pc$x[,3],
              xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
              ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
              zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
              color=sex.cols[as.integer(factor(sex))],
              pch=c(19,17)[as.integer(factor(dat))])
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topleft",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)

par(mar=mar)

Sometimes the 3D PCA are not so easy to read and we may be just as well off looking at 2 dimensions at a time. First the 2 first dims

plot(pc$x[,1],  
     pc$x[,2],
     xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
     ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
     col=sex.cols[as.integer(factor(sex))],
     pch=c(19,17)[as.integer(factor(dat))],
     main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)
text(pc$x[,1],  
     pc$x[,2],
     labels=conditions,cex=.5,adj=-1)

Clearly the sampling date separate most samples. It is interesting to see that the 2 samples sequenced deeper are in between the 2 clusters. We could hypothesize that the environmental change between sampling years are affecting less important genes, hence genes which expression levels are lower; i.e. somewhat looking at transcriptional “noise”. Since the 2 parent samples have been sequenced much deeper, the fact that they group with neither cluster might be due to that fact that they too have an increased proportion of transcriptional noise. Looking at the 2nd and 3rd dims does not reveal any obvious effects

plot(pc$x[,2],
     pc$x[,3],
     xlab=paste("Comp. 2 (",percent[2],"%)",sep=""),
     ylab=paste("Comp. 3 (",percent[3],"%)",sep=""),
     col=sex.cols[as.integer(factor(sex))],
     pch=c(19,17)[as.integer(factor(dat))],     
     main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
       legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
       legend=c("Symbol:",levels(factor(dat))),cex=0.85)

14.4.2 Heatmap

Another way to look at the data is to create a heatmap. As heatmaps are computationally demanding and hard to read if there is too much data, we look only at the 1000 most variable genes

sel <- order(apply(vst,1,sd),decreasing=TRUE)[1:1000]

First with their ID

heatmap.2(vst[sel,],labRow = NA,trace = "none",cexCol = 0.6 )

Then the sex

heatmap.2(vst[sel,],labRow = NA, labCol = sex,trace = "none",cexCol = 0.6 )

And the date

heatmap.2(vst[sel,],labRow = NA, labCol = date,trace = "none",cexCol = 0.6 )

15 Sexual dimorphism study Differential Expression

15.1 Doing the differential expression analysis.

    library(RnaSeqTutorial)
    samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
    res <- mclapply(dir(file.path(extdata(),"htseq"),
                    pattern="^[2,3].*_STAR\\.txt",
                    full.names=TRUE),function(fil){
      read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
    },mc.cores=2)
    names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
                                           pattern="^[2,3].*_STAR\\.txt"))
    addInfo <- c("__no_feature","__ambiguous",
             "__too_low_aQual","__not_aligned",
             "__alignment_not_unique")
    sel <- match(addInfo,res[[1]][,1])
    count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
    colnames(count.table) <- names(res)
    rownames(count.table) <- res[[1]][,1][-sel]
    conditions <- names(res)
    sex <- samples$sex[match(names(res),samples$sample)]
    date <- factor(samples$date[match(names(res),samples$sample)])
    pal=brewer.pal(8,"Dark2")

The QA revealed that we have a confounding factor. Luckily, enough the grouping of sex is balanced between sampling years and hence we will be able to block the year effect to investigate the sex effect.

yearSexDesign <- samples[grep("[2,3].*",samples$sample),c("sex","date")]

15.1.1 DESeq

The primary reason to use DESeq over DESeq2 or edgeR or any other is that DESeq is very conservative. The second reason is that DESeq give less weight to outliers than DESeq2 does and based on the study by Pakull et al., which identify a candidate gene for sexual determination - the same we simultaneously, independantly originally identified in our analyses - it seems that our sample 226.1 may be mis-classified. However, the sex phenotyping has been done thoroughly, so it may also be that the sex determination is a more complex trait and that Potri.019G047300 is only one factor influencing it As introduced we want to look at the sex by blocking the year effect. We start by estimating the size factors and the dispersion

cdsFull <- newCountDataSet(count.table,yearSexDesign)
cdsFull = estimateSizeFactors( cdsFull )
cdsFull = estimateDispersions( cdsFull )

We can now check the dispersion estimated by DESeq, which is, obviously, conservative. I.e. most dispersion values lie below the actual dispersion fit by about 1 fold.

plotDispLSD(cdsFull)

Next we create both models (one considering the date only and on the date and sex)

fit1 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date + sex ))
## ...............................
fit0 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date))
## ...............................

For the rest of the analysis, we ignore the genes that did not converge in the previous step

sel <- !is.na(fit1$converged) & !is.na(fit0$converged) & fit1$converged & fit0$converged
fit1 <- fit1[sel,]
fit0 <- fit0[sel,]

We next calculate the GLM p-value and adjust them for multiple testing

pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )

We finally visualize the obtained results. A first insight shows that a number of genes have very high fold changes, which is due the fact that these genes have very few read in very few samples. For clarity, in the following plots, we filter those genes with a too high FC.

boxplot(rowSums(count.table[rownames(fit1[fit1$sexM > 20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme high log2FC"
)

boxplot(rowSums(count.table[rownames(fit1[fit1$sexM < -20,]),]>0),
        ylab="number of sample with expression > 0",
        main="distribution of the # of samples with\nexpression > 0 for genes with extreme low log2FC"
)

The vast majority of these genes are anyway not significant.

plot(density(padjGLM[fit1$sexM > 20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")

plot(density(padjGLM[fit1$sexM > -20]),
     main="Adj. p-value for genes with extreme log2FC",
     xlab="Adj. p-value")

So we filter them out

sel <- fit1$sexM > -20 & fit1$sexM < 20
fit1 <- fit1[sel,]
padjGLM <- padjGLM[sel]
pvalsGLM <- pvalsGLM[sel]

A further look into the data reveals that one p-value is equal to 0. As this is inadequate for plotting log odds (-log of the p-value), we change these 0s to be 1 log10 fold change smaller than the otherwise smallest p-value (for plotting purposes only!)

padjGLM[padjGLM==0] <- min(padjGLM[padjGLM!=0])/10
pvalsGLM[pvalsGLM==0] <- min(pvalsGLM[pvalsGLM!=0])/10

15.1.2 Visualize DESeq results

As a volcano plot with the non-adjusted p-values The red dots circles are the 4 most significantly differentially expressed genes, and the 8 with the absolute highest log2 fold changes (4 positive and 4 negative)

heatscatter(fit1$sexM,-log10(pvalsGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) p-value")
legend("topleft",bty="n","1% p-value cutoff",lty=2,col="gray")
points(fit1[pvalsGLM<.01,"sexM"],
       -log10(pvalsGLM[pvalsGLM<.01]),
       col="lightblue",pch=19)
pos <- c(order(pvalsGLM)[1:4],
         order(fit1$sexM,decreasing=TRUE)[1:4],
         order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(pvalsGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As a volcano plot with adjusted p-values

heatscatter(fit1$sexM,-log10(padjGLM),
            main="Male vs. Female Differential Expression",cor=FALSE,
            xlab="Log2 Fold Change", ylab="- log(10) adj. p-value")
legend("topleft",bty="n","1% FDR cutoff",lty=2,col="gray")
points(fit1[padjGLM<.01,"sexM"],-log10(padjGLM[padjGLM<.01]),col="lightblue",pch=19)
pos <- c(order(padjGLM)[1:4],order(fit1$sexM,decreasing=TRUE)[1:4],order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(padjGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")

As DESeq is an “older” implementation of this type of analysis, we will redo the analysis using DESeq2. To ultimately be able to compare the obtained set of differential expression candidate genes, we save the list of genes significant at a 1% adjusted p-value cutoff.

UmAspD <- rownames(fit1[padjGLM<.01,])

which are not so many (one)

UmAspD
## [1] "Potra000503g03273.0"

15.1.3 DESeq2 differential expression analyses

We redo the analyses performed above. As you will see, it has been made more intuitive in DESeq2 and the same can be achieved in fewer commands.

dds <- DESeqDataSetFromMatrix(
    countData = count.table,
    colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
    design = ~date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)

res <- results(dds,contrast=c("sex","M","F"))

In 4 commands we have reproduced the analysis and we have observed that the dispersion estimation looked different. DESeq2 introduces a so called “shrinkage”, which you can learn more about in the DESeq2 vignettes. We, next, extract the results using the same 1% cutoff and plot similar validation plots

alpha=0.01
plotMA(res,alpha=alpha)

volcanoPlot(res,alpha=alpha)

## NULL

The volcano plot look similar to what we observed previously (except that we have by mistake inverted the ratio calculation, we now did F-M), although slightly less dead that in DESeq. We also do not get any genes with a p-value of 0

hist(res$padj,breaks=seq(0,1,.01))

sum(res$padj<alpha,na.rm=TRUE)
## [1] 3

4 genes are candidates for differential expression at a 1% adjusted p-value cutoff, which we also save for later comparison

UmAspD2 <- rownames(res[order(res$padj,na.last=TRUE),][1:sum(res$padj<alpha,na.rm=TRUE),])
UmAspD2
## [1] "Potra003910g23469.0" "Potra001414g11999.0" "Potra004533g25038.0"

We can already observe that one gene is common with the previous list, but that the chromosome 19 candidate has disappeared. This is surprising, given the Pakull et al. publication, that Potri.019G047300.0 (the homolog of Potra000503g03273) does not come out anymore as a differentially expressed candidate gene at a 1% FDR cutoff with DESeq2. But it’s adjusted p-value is 2.1% - see below. Moreover it’s cook distance - a measure of the homogeneity of the gene expression within a condition - is relatively high (2x the average) and might indicate the presence of an outlier. Looking into more details at the count values and sex assignment shows that the sample 226.1 behaves like a male sample with regards to that gene, so either the sample has been mis-sexed or the Pakull et al. results in P. tremuloides and P.tremula are only a partial view of the true biology.

as.data.frame(res)["Potra000503g03273.0",]
##                     baseMean log2FoldChange lfcSE stat  pvalue padj
## Potra000503g03273.0      125            1.1  0.24  4.6 3.9e-06 0.02
data.frame(
  counts=counts(dds,normalized=TRUE)["Potra000503g03273.0",],
  sex,
  date,
  conditions)
##       counts sex date conditions
## 202     7.64   F 2008        202
## 207   145.27   M 2010        207
## 213.1   0.00   F 2008      213.1
## 221   261.99   M 2008        221
## 226.1  23.02   F 2010      226.1
## 229.1 305.21   M 2008      229.1
## 229    79.92   M 2010        229
## 235    91.24   M 2010        235
## 236     0.92   F 2010        236
## 239     0.00   F 2010        239
## 244     2.22   F 2010        244
## 303   476.78   M 2008        303
## 305.3  87.38   F 2008      305.3
## 309.1 237.93   M 2008      309.1
## 310.3  49.12   F 2008      310.3
## 337.1 349.03   M 2008      337.1
## 349.2   5.15   F 2008      349.2
mcols(dds)[names(rowRanges(dds))=="Potra000503g03273.0",]
## DataFrame with 1 row and 36 columns
##    baseMean   baseVar   allZero dispGeneEst   dispFit dispersion  dispIter
##   <numeric> <numeric> <logical>   <numeric> <numeric>  <numeric> <numeric>
## 1       125     21809     FALSE         1.3      0.13       0.99        10
##   dispOutlier   dispMAP Intercept  date2008  date2010      sexF      sexM
##     <logical> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1       FALSE      0.99       6.2      0.64     -0.64     -0.56      0.56
##   SE_Intercept SE_date2008 SE_date2010   SE_sexF   SE_sexM MLE_Intercept
##      <numeric>   <numeric>   <numeric> <numeric> <numeric>     <numeric>
## 1         0.35        0.25        0.25      0.12      0.12           4.8
##   MLE_date_2010_vs_2008 MLE_sex_M_vs_F WaldStatistic_Intercept
##               <numeric>      <numeric>               <numeric>
## 1                  -1.9            3.7                      18
##   WaldStatistic_date2008 WaldStatistic_date2010 WaldStatistic_sexF
##                <numeric>              <numeric>          <numeric>
## 1                    2.6                   -2.6               -4.6
##   WaldStatistic_sexM WaldPvalue_Intercept WaldPvalue_date2008
##            <numeric>            <numeric>           <numeric>
## 1                4.6              4.7e-69              0.0099
##   WaldPvalue_date2010 WaldPvalue_sexF WaldPvalue_sexM  betaConv  betaIter
##             <numeric>       <numeric>       <numeric> <logical> <numeric>
## 1              0.0099         3.9e-06         3.9e-06      TRUE        18
##    deviance  maxCooks
##   <numeric> <numeric>
## 1       183      0.77

15.1.4 DESeq2 with the sample re-sexed

Nevertheless, while waiting for the next time the tree flowers and we can confirm its sex with certainty, we can assume that the sample was mis-sexed and see what effect a sex correction would have. Given the data from Pakull et al. about that gene; assuming that the sample 226.1 was mis-sexed, we redo the DESeq2 analysis after swaping the sex of that sample

sex[conditions=="226.1"] <- "M"
dds <- DESeqDataSetFromMatrix(
  countData = count.table,
  colData = data.frame(condition=conditions,
                       sex=sex,
                       date=date),
  design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res.swap <- results(dds,contrast=c("sex","M","F"))
plotMA(res.swap,alpha=alpha)

volcanoPlot(res.swap,alpha=alpha)

## NULL
hist(res.swap$padj,breaks=seq(0,1,.01))

sum(res.swap$padj<alpha,na.rm=TRUE)
## [1] 7
UmAspD2.swap <- rownames(res.swap[order(res.swap$padj,na.last=TRUE),][1:sum(res.swap$padj<alpha,na.rm=TRUE),])
UmAspD2.swap
## [1] "Potra000503g03273.0" "Potra001218g10512.0" "Potra001519g12651.0"
## [4] "Potra003910g23469.0" "Potra005027g34078.0" "Potra181824g28189.0"
## [7] "Potra002176g16783.0"

Fair enough, the gene made it back in the list.

15.1.5 DESeq2 analysis - sample removed

But, still, to assess the importance of the 226.1 sample, we redo the DESeq2 DE analysis without it (since we have enough replicate for that removal not to affect the power of our analysis)

sel <- conditions != "226.1"
dds <- DESeqDataSetFromMatrix(
  countData = count.table[,sel],
  colData = data.frame(condition=conditions[sel],
                       sex=sex[sel],
                       date=date[sel]),
  design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)

res.sel <- results(dds,contrast=c("sex","M","F"))
plotMA(res.sel,alpha=alpha)

volcanoPlot(res.sel,alpha=alpha)

## NULL
hist(res.sel$padj,breaks=seq(0,1,.01))

sum(res.sel$padj<alpha,na.rm=TRUE)
## [1] 8
UmAspD2.sel <- rownames(res.sel[order(res.sel$padj,na.last=TRUE),][1:sum(res.sel$padj<alpha,na.rm=TRUE),])
UmAspD2.sel
## [1] "Potra003910g23469.0" "Potra005027g34078.0" "Potra000503g03273.0"
## [4] "Potra001414g11999.0" "Potra001519g12651.0" "Potra004533g25038.0"
## [7] "Potra001218g10512.0" "Potra178727g27972.0"

15.2 Results comparison

We compare the results from the 4 approaches, the easiest way being to plot a Vann Diagram

plot.new()
grid.draw(venn.diagram(list(
  UmAsp=UmAspD2,
  UmAsp.swap=UmAspD2.swap,
  UmAsp.sel=UmAspD2.sel,
  UmAspD = UmAspD),
                       filename=NULL,
                       col=pal[1:4],
                       category.names=c("UmAsp - DESeq2",
                                        "UmAsp - corrected",
                                        "UmAsp - removed",
                                        "UmAsp - DESeq")))

Unlike in the original analysis conducted in the study, where the Populus trichocarpa genome was used as a reference (the P. tremula genome assembly is still a draft assembly), where the gene Potri.014G155300.0 was constantly found by all 4 approaches and the gene Potri.019G047300.0 (Potra000503g03273.0 homolog) was found by DESeq and DESeq2, there is no overlap between both native methods. This is probably because the result in P. tremula are obscured by an assembly artefact. Indeed we know that the two genes in P. trichocarpa have a very high sequence similarity (being putatively very recent paralogs), which appears to have been collapsed into a single locus in P. tremula. We can also note that “messing” with the sample sex attribute or removing that sample leads to the identification of several more genes! It is unclear if these are the results of a technical error or if they would be worth investigating further.

sort(intersect(UmAspD2,UmAspD))
## character(0)
sort(intersect(UmAspD2.swap,UmAspD))
## [1] "Potra000503g03273.0"
sort(intersect(UmAspD2.sel,UmAspD))
## [1] "Potra000503g03273.0"

15.3 Appendix

Analyses performed when asking Mike Love (DESeq2 developer) why DESeq2 seem so sensitive to misclassification The short answer was to use the Cook distance to further investigate that and that the 1% FDR cutoff was conservative. These are commented out on purpose. Note that the rowData function has been replaced by rowRanges in Bioconductor version 3.1 and above.

## mis-classified
# sex <- samples$sex[match(colnames(count.table),samples$sample)]
# date <- factor(samples$date[match(colnames(count.table),samples$sample)])
# ddsM <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sex,
#                        date=date),
#   design = ~ date+sex)
# ddsM <- DESeq(ddsM)
# resM <- results(ddsM,contrast=c("sex","F","M"))
# 
# counts(ddsM,normalized=TRUE)["Potri.019G047300.0",]
# resM["Potri.019G047300.0",]
# mcols(ddsM)[names(rowData(ddsM)) == "Potri.019G047300.0",]
# 
# ## reclassified
# sexR <- sex
# sexR[5] <- "M"
# ddsR <- DESeqDataSetFromMatrix(
#   countData = count.table,
#   colData = data.frame(condition=conditions,
#                        sex=sexR,
#                        date=date),
#   design = ~ date+sex)
# ddsR <- DESeq(ddsR)
# resR <- results(ddsR,contrast=c("sex","F","M"))
# 
# counts(ddsR,normalized=TRUE)["Potri.019G047300.0",]
# resR["Potri.019G047300.0",]
# mcols(ddsR)[names(rowData(ddsR)) == "Potri.019G047300.0",]
# 
# ## samples details
# sex
# date
# 
# ## the model is not be affected by the reclassification
# lapply(split(sex,date),table)
# lapply(split(sexR,date),table)
# 

16 Session Info

This conclude this differential expression analysis. The following details the R version that was used to perform the analyses.

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
## 
## locale:
## [1] C/UTF-8/C/C/C/C
## 
## attached base packages:
##  [1] grid      stats4    parallel  methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] hexbin_1.27.1                            
##  [2] RnaSeqTutorial_0.99.4                    
##  [3] VennDiagram_1.6.17                       
##  [4] futile.logger_1.4.3                      
##  [5] vsn_3.42.3                               
##  [6] scatterplot3d_0.3-37                     
##  [7] RColorBrewer_1.1-2                       
##  [8] LSD_3.0                                  
##  [9] gplots_3.0.1                             
## [10] DESeq2_1.14.0                            
## [11] DESeq_1.26.0                             
## [12] locfit_1.5-9.1                           
## [13] BSgenome.Dmelanogaster.UCSC.dm3_1.4.0    
## [14] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
## [15] ShortRead_1.32.0                         
## [16] BiocParallel_1.8.1                       
## [17] seqLogo_1.40.0                           
## [18] leeBamViews_1.10.0                       
## [19] BSgenome_1.42.0                          
## [20] rtracklayer_1.34.0                       
## [21] GenomicFeatures_1.26.0                   
## [22] AnnotationDbi_1.36.0                     
## [23] GenomicAlignments_1.10.0                 
## [24] Rsamtools_1.26.1                         
## [25] Biostrings_2.42.0                        
## [26] XVector_0.14.0                           
## [27] SummarizedExperiment_1.4.0               
## [28] Biobase_2.34.0                           
## [29] GenomicRanges_1.26.1                     
## [30] GenomeInfoDb_1.10.0                      
## [31] IRanges_2.8.0                            
## [32] S4Vectors_0.12.0                         
## [33] genomeIntervals_1.30.0                   
## [34] intervals_0.15.1                         
## [35] easyRNASeq_2.10.0                        
## [36] BiocGenerics_0.20.0                      
## [37] lattice_0.20-34                          
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6          tools_3.3.1           affyio_1.44.0        
##  [4] rpart_4.1-10          KernSmooth_2.23-15    Hmisc_3.17-4         
##  [7] DBI_0.5-1             colorspace_1.2-7      nnet_7.3-12          
## [10] gridExtra_2.2.1       preprocessCore_1.36.0 chron_2.3-47         
## [13] formatR_1.4           labeling_0.3          caTools_1.17.1       
## [16] scales_0.4.0          genefilter_1.56.0     affy_1.52.0          
## [19] stringr_1.1.0         digest_0.6.10         foreign_0.8-67       
## [22] rmarkdown_1.1         htmltools_0.3.5       limma_3.30.2         
## [25] RSQLite_1.0.0         BiocInstaller_1.24.0  hwriter_1.3.2        
## [28] gtools_3.5.0          acepack_1.4.1         RCurl_1.95-4.8       
## [31] magrittr_1.5          Formula_1.2-1         Matrix_1.2-7.1       
## [34] Rcpp_0.12.7           munsell_0.4.3         stringi_1.1.2        
## [37] yaml_2.1.13           edgeR_3.16.1          zlibbioc_1.20.0      
## [40] plyr_1.8.4            gdata_2.17.0          splines_3.3.1        
## [43] annotate_1.52.0       knitr_1.14            geneplotter_1.52.0   
## [46] biomaRt_2.30.0        futile.options_1.0.0  XML_3.98-1.4         
## [49] evaluate_0.10         latticeExtra_0.6-28   lambda.r_1.1.9       
## [52] data.table_1.9.6      gtable_0.2.0          assertthat_0.1       
## [55] ggplot2_2.1.0         xtable_1.8-2          survival_2.40-1      
## [58] tibble_1.2            cluster_2.0.5

17 Acknowledgments

  1. Thanks to the Workshop organizer (Ari Löytinoja.

  2. Thanks to Martin Morgan ( R and Bioconductor guru among other things, not to say head of the Bioconductor core at FHCRC - he never ceases to amaze us) for part of the original of most the material in your hands that we used during the EMBO October 2012 workshop. Time flies.

  3. Thanks to Ângela Gonçalves for the original material of the section on estimating expression.

  4. Thanks to the all lecturers, it is always fun around you.

  5. Finally, thanks to you the reader - whatever the support you’re reading this on - for having made it that far.

References

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11: R106. doi:10.1186/gb-2010-11-10-r106.

Anders, Simon, Paul Pyl, and Wolfgang Huber. 2014. “HTSeq–A Python Framework to Work with High-Throughput Sequencing Data.” BioRxiv.

Bolger, Anthony M, Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. doi:10.1093/bioinformatics/btu170.

Bray, Nicolas L, Harold Pimentel, Pall Melsted, and Lior Pachter. 2016. “Near-optimal probabilistic RNA-seq quantification.” Nat Biotech 34 (5). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 525–27. http://dx.doi.org/10.1038/nbt.3519 http://10.1038/nbt.3519 http://www.nature.com/nbt/journal/v34/n5/abs/nbt.3519.html{\#}supplementary-information.

Burrows M, Wheeler DJ. 1994. “A Block-Sorting Lossless Data Compression Algorithm.” Technical Report 124. Palo Alto, CA: Digital Equipment Corporation.

Chambers, John M. 2008. Software for Data Analysis: Programming with R. New York: Springer. http://stat.stanford.edu/~jmc4/Rbook/.

Cock, Peter J A, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2009. “The Sanger Fastq File Format for Sequences with Quality Scores, and the Solexa/Illumina Fastq Variants.” Nucleic Acids Research, December. doi:10.1093/nar/gkp1137.

Dalgaard, Peter. 2008. Introductory Statistics with R. 2nd ed. Springer. http://www.biostat.ku.dk/~pd/ISwR.html.

Delhomme, Nicolas, Ismaël Padioleau, Eileen E. Furlong, and Lars M. Steinmetz. 2012. “EasyRNASeq: A Bioconductor Package for Processing Rna-Seq Data.” Bioinformatics in press: in press. doi:10.1093/bioinformatics/bts477.

Dillies, Marie-Agnès, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, et al. 2012. “A Comprehensive Evaluation of Normalization Methods for Illumina High-Throughput Rna Sequencing Data Analysis.” Brief Bioinformatics, September. doi:10.1093/bib/bbs046.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal Rna-Seq Aligner.” Bioinformatics 29 (1): 15–21. doi:10.1093/bioinformatics/bts635.

Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40. doi:10.1093/bioinformatics/bti525.

Gentleman, Robert. 2008. R Programming for Bioinformatics. Computer Science & Data Analysis. Boca Raton, FL: Chapman & Hall/CRC. http://www.bioconductor.org/pub/RBioinf/.

Glaus, Peter, Honkela, Antti, Rattray, and Magnus. 2012. “Identifying Differentially Expressed Transcripts from Rna-Seq Data with Biological Variation.” Bioinformatics 28 (13): 1721–8. doi:10.1093/bioinformatics/bts260.

Grabherr, Manfred G, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from Rna-Seq Data Without a Reference Genome.” Nat Biotechnol 29 (7): 644–52. doi:10.1038/nbt.1883.

Holmes, Ian AND Harris. 2012. “Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics.” PLoS ONE 7 (2). Public Library of Science: e30126. doi:10.1371/journal.pone.0030126.

Kabacoff, Rob. 2010. R in Action. Manning. http://www.manning.com/kabacoff.

Kasprzyk, Arek. 2011. “BioMart: Driving a Paradigm Change in Biological Data Management.” Database (Oxford) 2011 (January): bar049. doi:10.1093/database/bar049.

Khan, Zia, Joshua S Bloom, Leonid Kruglyak, and Mona Singh. 2009. “A Practical Algorithm for Finding Maximal Exact Matches in Large Sequence Datasets Using Sparse Suffix Arrays.” Bioinformatics 25 (13): 1609–16. doi:10.1093/bioinformatics/btp275.

Kopylova, Evguenia, Laurent Noé, and Hélène Touzet. 2012. “SortMeRNA: Fast and Accurate Filtering of Ribosomal Rnas in Metatranscriptomic Data.” Bioinformatics 28 (24): 3211–7. doi:10.1093/bioinformatics/bts611.

Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biol. 10: R25.

Leśniewska, Anna, and Michał J Okoniewski. 2011. “RnaSeqMap: A Bioconductor Package for Rna Sequencing Data Exploration.” BMC Bioinformatics 12 (January): 200. doi:10.1186/1471-2105-12-200.

Li, Bo, Victor Ruotti, Ron M Stewart, James A Thomson, and Colin N Dewey. 2010. “RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty.” Bioinformatics 26 (4): 493–500. doi:10.1093/bioinformatics/btp692.

Li, H., and R. Durbin. 2009. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics 25 (July): 1754–60.

———. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics 26 (March): 589–95.

Li, Heng, and Nils Homer. 2010. “A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing.” Briefings in Bioinformatics 11 (5): 473–83. doi:10.1093/bib/bbq015.

Matloff, N. 2011. The Art of R Programming. No Starch Pess.

Meyer, Laurence R, Ann S Zweig, Angie S Hinrichs, Donna Karolchik, Robert M Kuhn, Matthew Wong, Cricket A Sloan, et al. 2013. “The Ucsc Genome Browser Database: Extensions and Updates 2013.” Nucleic Acids Res 41 (Database issue): D64–9. doi:10.1093/nar/gks1048.

Mortazavi, Ali, and others. 2008. “Mapping and Quantifying Mammalian Transcriptomes by Rna-Seq.” Nature Methods 5 (7): 621–8.

Robinson, Kathryn, Nicolas Delhomme, Niklas Mähler, Bastian Schiffthaler, Jenny Önskog, Benedicte Albrectsen, Pär Ingvarsson, Torgeir Hvidsten, Stefan Jansson, and Nathaniel Street. 2014. “Populus Tremula (European Aspen) Shows No Evidence of Sexual Dimorphism.” BMC Plant Biology 14 (1): 276–76. doi:10.1186/s12870-014-0276-5.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (January): 139–40.

Schulz, Marcel H, Daniel R Zerbino, Martin Vingron, and Ewan Birney. 2012. “Oases: Robust de Novo Rna-Seq Assembly Across the Dynamic Range of Expression Levels.” Bioinformatics 28 (8): 1086–92. doi:10.1093/bioinformatics/bts094.

Soneson, Charlotte, and Mauro Delorenzi. 2013. “A Comparison of Methods for Differential Expression Analysis of Rna-Seq Data.” BMC Bioinformatics 14: 91. doi:10.1186/1471-2105-14-91.

Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. 2010. “Transcript Assembly and Quantification by Rna-Seq Reveals Unannotated Transcripts and Isoform Switching During Cell Differentiation.” Nat Biotechnol 28 (5): 511–5. doi:10.1038/nbt.1621.

Turro, Ernest, Shu-Yi Su, Ângela Gonçalves, Lachlan J M Coin, Sylvia Richardson, and Alex Lewin. 2011. “Haplotype and Isoform Specific Expression Estimation Using Multi-Mapping Rna-Seq Reads.” Genome Biol 12 (2): R13. doi:10.1186/gb-2011-12-2-r13.

Wu, Thomas D, and Colin K Watanabe. 2005. “GMAP: A Genomic Mapping and Alignment Program for MRNA and Est Sequences.” Bioinformatics 21 (9): 1859–75. doi:10.1093/bioinformatics/bti310.


  1. Do not worry if some of this is illegible to you at present, it should be clear by the end of the workshop if we do our job

  2. See http://www.epigenesys.eu/images/stories/protocols/pdf/20150303161357_p67.pdf

  3. The authors want to thank Martin Morgan for the original material of the present chapter

  4. The alignment rate depends on the genome quality and completeness and can hence have a large range - the values presented here are from the Norway Spruce, a version 1 draft of the genome.

  5. The values presented here report only uniquely aligning reads. In our example, the rate of non aligning reads is usually equal to the rate of multi-mapping reads, i.e. about 10% for both in the worst cases.

  6. The author want to thank Ângela Gonçalves for parts of the present chapter